breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	f9d71d50d1	Add exclude region marking in Structure step CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Users can now draw rectangles on the document image in the Structure Detection step to mark areas (e.g. header graphics, alphabet strips) that should be excluded from OCR results during grid building. - Backend: PUT/DELETE endpoints for exclude regions stored in structure_result - Backend: _build_grid_core() filters all words inside user-defined exclude regions - Frontend: Interactive rectangle drawing with visual overlay and delete buttons - Preserve exclude regions when re-running structure detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 09:08:30 +01:00
Benjamin Admin	c09838e91c	Fix spine shadow false positives: require dark valley, brightness rise, trim convolution edges CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 16s Details The _detect_spine_shadow function was triggering on normal text content because shadow_range > 20 was too low and convolution edge artifacts created artificially low values. Now requires: range > 40, darkest < 180, narrow valley (not text plateau), and brightness rise toward page content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 08:23:50 +01:00
Benjamin Admin	3fd6523872	Cut at spine center (darkest point) instead of shadow edge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Refactor left/right shadow detection into shared _detect_spine_shadow() that finds the darkest column (= book spine center) via argmin of smoothed brightness. Both sides now cut at the spine center, ensuring equal page sizes in double-page scans regardless of shadow position. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 07:54:33 +01:00
Benjamin Admin	e56391b0c3	Add right-edge spine shadow detection for book scans CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 22s Details Mirror the left-edge shadow detection for the right side: analyze brightness gradient in the right 25% to find scanner gray strips from book spines. Cuts at the last bright column before the shadow dip. Fixes cropping of book scans where the next page bleeds in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-19 07:41:13 +01:00
Benjamin Admin	a3e2a7f994	Add GT button to OCR overlay, prominent category picker, track pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details - Ground Truth button on last step of Pipeline/Kombi modes in ocr-overlay - Prominent category picker in active session info bar (pulses when unset) - GT badge shown when session has ground truth reference - Backend: auto-detect pipeline from ocr_engine, store in GT snapshot - Pipeline info shown in GT session list and regression reports - Also pass pipeline param from ocr-pipeline StepGroundTruth Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 14:49:02 +01:00
Benjamin Admin	f655db30e4	Add Ground Truth regression test system for OCR pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m47s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 22s Details Extract _build_grid_core() from build_grid() endpoint for reuse. New ocr_pipeline_regression.py with endpoints to mark sessions as ground truth, list them, and run regression comparisons after code changes. Frontend button in StepGroundTruth.tsx to mark/update GT. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 13:46:48 +01:00
Benjamin Admin	c894a0feeb	Improve IPA continuation row detection with phonetic heuristics Strip IPA brackets that fix_cell_phonetics may have added for short dictionary words (e.g. "si" → "[si]") before checking if the row is a garbled phonetic continuation. Detect phonetic text by presence of ':' (length marks), leading apostrophe (stress marks), or absence of any word with ≥3 letters. Fixes Row 39 ("si: [si] — So: - si:n") not being removed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 12:08:21 +01:00
Benjamin Admin	8ef4c089cf	Remove IPA continuation rows and support hyphenated word lookup - grid_editor_api: After IPA correction, detect rows containing only garbled phonetics in the English column (no German translation, no IPA brackets inserted). These are wrap-around lines where printed IPA extends to the line below the headword. Remove them since the headword row already has correct IPA. - cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form as fallback (e.g. "second-hand" → "secondhand") for dictionary lookup, fixing IPA insertion for compound words. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 12:05:38 +01:00
Benjamin Admin	821e5481c2	Only apply IPA correction on vocabulary tables (≥3 columns) Single-column German text pages were getting IPA inserted for words that happen to exist in the English dictionary ("die" → [dˈaɪ], "Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3 columns, which is the minimum for a vocabulary table layout (English \| article \| German). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 11:50:03 +01:00
Benjamin Admin	b98ea33a3a	Strip garbled OCR phonetics after IPA insertion _insert_missing_ipa now removes garbled phonetic text (e.g. "skea", "sku:l", "'sizaz") that follows the inserted IPA bracket. Keeps delimiters (–, -), uppercase words (German), and known English words. Fixes: "scare [skˈɛə] skea" → "scare [skˈɛə]" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 11:15:14 +01:00
Benjamin Admin	f139d0903e	Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid - _merge_inline_marker_columns: skip merge when ≥50% of words are alphabetic (preserves "to", "in", "der" columns) - Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &") - IPA phonetics: map longest-avg-text column to column_en so fix_cell_phonetics runs in the grid editor - ocr_pipeline_overlays: add missing split_page_into_zones import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 11:08:23 +01:00
Benjamin Admin	962bbbe9f6	Remove scattered debris rows and disable spanning header detection - Add Rule 3 to junk-row filter: rows where no word is longer than 2 chars are removed as scattered OCR debris from illustrations - Fully disable spanning-header detection which falsely flagged IPA transcriptions and vocabulary entries as spanning headers - First-row heuristic remains for genuine header detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 10:47:17 +01:00
Benjamin Admin	9da45c2a59	Fix false header detection and add decorative margin/footer filters - Remove all_colored spanning header heuristic that falsely flagged colored vocabulary entries (Scotland, secondary school) as headers - Add _filter_decorative_margin: removes vertical A-Z alphabet strips along page margins (single-char words in a compact vertical strip) - Add _filter_footer_words: removes page numbers in bottom 5% of page - Tighten spanning header rule: require ≥3 columns spanned + ≤3 words Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 10:38:20 +01:00
Benjamin Admin	64447ad352	Raise color sat_threshold from 50 to 55 to avoid scanner blue artifacts Black text has median_sat ~6-7, green text ~63-65. At threshold 50, scanner blue tints (median_sat ~50-54) on words like "Wasser" were falsely classified as blue. Threshold 55 has good margin on both sides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 09:13:09 +01:00
Benjamin Admin	00cbf266cb	Add oversized-stub filter for large page numbers/marks in grid rows Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median are removed as non-content elements (e.g. red page number "( 9"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 09:05:07 +01:00
Benjamin Admin	f9bad7beaa	Filter phantom rows from recovered color artifacts and low-conf OCR noise - Apply recovered-artifact filter to ALL zones (was box-zones only) - Filter any recovered word with text ≤ 2 chars (not just !?•·) - Add post-grid junk-row removal: rows where all word_boxes have conf < 50 and text ≤ 3 chars are dropped as OCR noise Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 09:00:43 +01:00
Benjamin Admin	143e41ec76	add: ocr_pipeline_overlays.py for overlay rendering functions Extracted 4 overlay functions (_get_structure_overlay, _get_columns_overlay, _get_rows_overlay, _get_words_overlay) that were missing from the initial split. Provides render_overlay() dispatcher used by sessions module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 08:46:49 +01:00
Benjamin Admin	ec287fd12e	refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules Each module is under 1050 lines: - ocr_pipeline_common.py (354) - shared state, cache, models, helpers - ocr_pipeline_sessions.py (483) - session CRUD, image serving, doc-type - ocr_pipeline_geometry.py (1025) - deskew, dewarp, structure, columns - ocr_pipeline_rows.py (348) - row detection, box-overlay helper - ocr_pipeline_words.py (876) - word detection (SSE), paddle-direct - ocr_pipeline_ocr_merge.py (615) - merge helpers, kombi endpoints - ocr_pipeline_postprocess.py (929) - LLM review, reconstruction, export - ocr_pipeline_auto.py (705) - auto-mode orchestrator, reprocess ocr_pipeline_api.py is now a 61-line thin wrapper that re-exports router, _cache, and test-imported symbols for backward compatibility. No changes needed in main.py or tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 08:42:00 +01:00
Benjamin Admin	98f7f7d7d5	fix: NameError in paddle_kombi/rapid_kombi cache update The previous commit added `cached["word_result"]` but `cached` was not defined in these functions. Changed to safely check `_cache` dict first. Also includes sat_threshold fix (70→50) for green text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 08:12:01 +01:00
Benjamin Admin	a19bca6060	fix: lower color sat_threshold from 70 to 50 for green text detection Green text words like "Insel" and "Internet" had median_sat=65, just below the threshold of 70, causing them to be classified as black. Black text has median_sat=6-7, so threshold=50 provides clear separation (6-7 vs 63-65) without false positives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 08:00:35 +01:00
Benjamin Admin	5359a4cc2b	fix: cache word_result in paddle_kombi/rapid_kombi for detect-structure Both kombi OCR functions wrote word_result to DB but not to the in-memory cache. When detect-structure ran next, it found no words and passed an empty list to graphic detection, making all word-overlap heuristics ineffective. This caused green text words to be wrongly classified as graphic regions. Also adds a fallback in detect-structure to use raw OCR word lists if cell word_boxes are empty. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 07:29:02 +01:00
Benjamin Admin	a25214126d	fix: merge overlapping OCR words with different text (Stick/Stück) Two issues in paddle-kombi word merge: 1. Overlap threshold too strict: PaddleOCR "Stick" and Tesseract "Stück" overlap at 48.6%, just below the 50% threshold. Both words ended up in the result, overlapping on the same position. Fix: lower threshold from 50% to 40%. 2. Text selection blind to confidence: always took PaddleOCR text even when Tesseract had higher confidence and correct text. Fix: when texts differ due to spatial-only match, prefer the engine with higher confidence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 07:00:57 +01:00
Benjamin Admin	19b93f7762	fix: conservative column detection + smart graphic word filter Column detection: - Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in flowing text where random gaps < 35% of rows) - Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3 - Vocabulary worksheets unaffected (columns appear in >80% of rows) Graphic word filter: - Only remove words with OCR confidence < 50 inside graphic regions - High-confidence words are real text, not image artifacts - Prevents legitimate colored text from being discarded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 18:19:25 +01:00
Benjamin Admin	a079ffe8e9	fix: robust colored-text detection in graphic filter The 25x25 dilation kernel merges nearby green words into large regions, so pixel-overlap with OCR word boxes drops below 50%. Previous density checks alone weren't sufficient. New multi-layered approach: - Count OCR word CENTROIDS inside each colored region - ≥2 centroids → definitely text (images don't produce multiple words) - 1 centroid + 10%+ pixel overlap → likely text - Lower pixel overlap threshold from 50% to 40% - Raise density+height thresholds for text-line detection - Use INFO logging to diagnose remaining false positives Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 18:09:16 +01:00
Benjamin Admin	6e1d715d0d	fix: prevent colored text from being falsely detected as graphics Add color pixel density checks to cv_graphic_detect.py Pass 1: - density < 20% → skip (text strokes are thin, images are filled) - density < 30% + height < 4% page → skip (colored text line) This fixes green headings (Insel, Internet, Inuit) being removed as graphic regions, which also caused word reordering in lines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 17:30:35 +01:00
Benjamin Admin	d66efdecf5	fix: NameError in detect_page_splits — 'gaps' var removed in rewrite CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 22s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 17:01:34 +01:00
Benjamin Admin	d36972b464	fix: detect spine by brightness, not ink density CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details The previous algorithm used binary ink projection and found false splits at normal text column gaps. The spine of a book on a scanner has a characteristic DARK gray strip (scanner bed) flanked by bright white paper on both sides. New approach: column-mean brightness with heavy smoothing, looking for a dark valley (< 88% of paper brightness) in the center region that has bright paper on both sides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 16:52:29 +01:00
Benjamin Admin	f30e526917	fix: merge nearby spine gaps + handle multi-page crop in frontend CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Backend: merge gaps within 5% of image width — the spine area may have thin ink strips splitting one physical gap into multiple detected gaps. Only use gaps >= 2% width as split points. Frontend: StepCrop now handles multi_page crop responses without crashing on missing original_size/cropped_size fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 16:44:32 +01:00
Benjamin Admin	438a4495c7	fix: swap 90°/270° rotation direction in orientation detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Tesseract OSD 'rotate' returns the clockwise correction needed, but the code was applying counterclockwise for 90° and clockwise for 270° — exactly reversed. This caused pages scanned sideways to be flipped upside down instead of corrected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 16:39:15 +01:00
Benjamin Admin	902de027f4	feat: auto-detect multi-page spreads and split into sub-sessions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details When a book scan (double-page spread) is detected during the crop step, the system automatically: 1. Detects vertical center gaps (spine area) via ink density projection 2. Splits into N page sub-sessions (reusing existing sub-session mechanism) 3. Individually crops each page (removing its own borders) 4. Returns sub-session IDs for downstream pipeline processing Detection: landscape images (w > h * 1.15), vertical gap < 15% peak density in center region (25-75%), gap width >= 0.8% of image width. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 16:34:06 +01:00
Benjamin Admin	b1cdb2531c	feat: CSS Grid editor with OCR-measured column widths and row heights CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px) to build-grid response for faithful grid reconstruction. Frontend: rewrite GridTable from HTML <table> to CSS Grid layout. Column widths are now proportional to the OCR-measured x_min/x_max positions. Row heights use the average content row height from the scan. Column and row resize via drag handles (Excel-like). Font: add Noto Sans (supports IPA characters) via next/font/google. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 13:48:47 +01:00
Benjamin Admin	ab30e8b17a	feat: apply IPA phonetic correction in build-grid combo mode CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 18s Details fix_cell_phonetics was only called in the OCR pipeline endpoints (/words, /cells) but not in the combo mode (build-grid / ocr-overlay). Garbled IPA like [teist] is now corrected to [teɪst] using the IPA dictionary, same as in the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 12:53:58 +01:00
Benjamin Admin	b0e1fbc8d6	feat: box zone artifact filter, spanning headers, parenthesis fix CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m59s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 19s Details 1. Filter recovered single-char artifacts (!, ?, •) from box zones where they are decorative noise, not real text markers 2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that stretch across multiple columns with colored text. Merge their cells into a single spanning cell in column 0. 3. Fix missing opening parentheses: when cell text has ")" but no matching "(", prepend "(" to the text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 11:31:55 +01:00
Benjamin Admin	872b47f691	fix: filter words and color recoveries inside graphic/image regions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m8s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details - Load structure_result from session to get detected graphic bounds - Exclude OCR words whose center falls inside a graphic region - Exclude recovered colored text inside graphic regions - Reject color recovery regions wider than 4x median word height Fixes garbage characters (!, ?, •) in box zones and false OCR detections (N, ?) in image areas. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 11:20:07 +01:00
Benjamin Admin	bbf0a5720e	fix: require both horizontal AND vertical overlap for word dedup CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m11s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 18s Details Previous version only checked X overlap, causing false positives for short words like "=" and "I" that appear at similar X positions in different rows. Now requires >=50% overlap in both dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:57:44 +01:00
Benjamin Admin	29d3c1caf5	fix: deduplicate overlapping words after Paddle+Tesseract merge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details PaddleOCR can return overlapping phrases (e.g. "von jm." and "jm. =") that produce duplicate words after splitting. Added _deduplicate_words() post-merge pass that removes words with same text at overlapping positions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:47:42 +01:00
Benjamin Admin	aae8a96aa2	fix: sort word_boxes in reading order (Y-grouped, then X-sorted) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details Words on the same visual line can have slightly different top values (1-6px). Sorting by (top, left) produced wrong word order in the frontend display. Now uses _group_words_into_lines to group by Y proximity first, then sort by X within each line. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:41:30 +01:00
Benjamin Admin	2b73d9beec	fix: increase color recovery occupancy padding to prevent gap artifacts CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Colored-pixel fragments in narrow inter-word gaps were being recovered as false characters (e.g., "!" between "lend" and "sb."), disrupting word order. Use adaptive padding based on median word height instead of fixed 4px. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:28:56 +01:00
Benjamin Admin	324f39a9cc	fix: merge inline marker columns + improve ghost edge detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details 1. Add _merge_inline_marker_columns(): narrow columns (<80px) with avg word length <=2 chars (bullets, numbering) are merged into the adjacent text column. Fixes box zones getting 2 columns when bullet points are just indentation markers. 2. Improve ghost filter: check word edges (left/right/top/bottom) against border bands instead of center-only. Catches = at x=947 whose left edge touches the box border. 3. Add = and + to _GRID_GHOST_CHARS for border artifact detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 10:10:07 +01:00
Benjamin Admin	febd0a2f84	fix: border ghost filter + row overlap fix for box zones CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details 1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts like \| sitting on box borders before row/column clustering. The tall \| (h=55) was inflating row 0's y_max, causing row overlap. 2. Fix _assign_word_to_row() to prefer closest y_center when rows overlap, instead of always returning the first matching row. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 09:54:50 +01:00
Benjamin Admin	43b1f8be58	diag: increase zone logging threshold to 60 words CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 17s Details Box zones have 40-60 words, need to capture their diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 09:49:19 +01:00
Benjamin Admin	43dec5dd91	diag: add row-clustering logging for small/box zones CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Logs word positions, median height, Y tolerance, and resulting rows for zones with <= 30 words to diagnose row merging issues. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 09:45:29 +01:00
Benjamin Admin	92a52a3199	fix: apply column union when total_cols >= max (not just >) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 17s Details Zone 4 found 4 columns incl. page_ref, union also yields 4. The strict > check prevented union from applying to Zone 0. Changed to >= so all content zones get the merged column set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 00:14:59 +01:00
Benjamin Admin	427fecdce0	fix: union column detection across all content zones CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Instead of propagating columns from the largest content zone only (which missed narrow columns like page_ref), collect column split points from ALL content zones and merge them. This way a column found in any zone (e.g. page_ref at x=132 in the zone below boxes) is available everywhere. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 23:02:33 +01:00
Benjamin Admin	9fb3229270	fix: lower tertiary gap threshold for narrow margin column detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details Reduce gap threshold from max(40, 5%) to max(30, 2%) so page_ref columns (e.g. p.55/p.57) at ~56px gap are detected as tertiary columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 22:56:03 +01:00
Benjamin Admin	91625a2646	fix: add tertiary tier for narrow margin columns (page refs, markers) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details Page references (p.55, p.57) and marker columns (!) appear in very few rows (< 12% coverage) but sit at the far left/right margin with a clear gap to the main content. Add a third detection tier that catches these narrow margin columns when they have >= 2 distinct rows and are within 15% of the content edge with >= 40px gap to the nearest main column. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 22:40:40 +01:00
Benjamin Admin	02ae6249ca	fix: propagate columns from largest content zone instead of global detection CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 20s Details CI / test-nodejs-website (push) Successful in 21s Details Global column detection diluted narrow sub-columns (page refs, markers) because they appeared in too few rows relative to the total. Instead, detect columns per zone independently, then propagate the best columns (from the content zone with the most words) to smaller content zones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 22:30:15 +01:00
Benjamin Admin	cf995f2d52	fix: global column detection across content zones in Kombi grid builder CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m3s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 26s Details Content zones (above/between/below boxes) now share the same column structure: columns are detected once from ALL content-zone words, then applied to each content zone. Box zones still detect columns independently. This fixes the issue where narrow columns (page refs like p.55) were not detected in small content zones above boxes, even though the same column existed in the larger content zone below the box. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 22:04:17 +01:00
Benjamin Admin	0340204c1f	feat: box-aware column detection — exclude box content from global columns CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m4s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 19s Details - Enrich column geometries with original full-page words (box-filtered) so _detect_sub_columns() finds narrow sub-columns across box boundaries - Add inline marker guard: bullet points (1., 2., •) are not split into sub-columns (minimum gap check: 1.2× word height or 20px) - Add box_rects parameter to build_grid_from_words() — words inside boxes are excluded from X-gap column clustering - Pass box rects from zones to words_first grid builder - Add 9 tests for box-aware column detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 18:42:46 +01:00
Benjamin Admin	729ebff63c	feat: add border ghost filter + graphic detection tests + structure overlay - Add _filter_border_ghost_words() to remove OCR artefacts from box borders (vertical + horizontal edge detection, column cleanup, re-indexing) - Add 20 tests for border ghost filter (basic filtering + column cleanup) - Add 24 tests for cv_graphic_detect (color detection, word overlap, boxes) - Clean up cv_graphic_detect.py logging (per-candidate → DEBUG) - Add structure overlay layer to StepReconstruction (boxes + graphics toggle) - Show border_ghosts_removed badge in StepStructureDetection - Update MkDocs with structure detection documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 18:28:53 +01:00

1 2 3 4 5 ...

284 Commits