breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	be86a7d14d	fix: preserve pipe syllable dividers + detect alphabet sidebar columns 1. Pipe divider fix: Changed OCR char-confusion regex so \| between letters (Ka\|me\|rad) is NOT converted to I. Only standalone/ word-boundary pipes are converted (\|ch → Ich, \| want → I want). 2. Alphabet sidebar detection improvements: - _filter_decorative_margin() now considers 2-char words (OCR reads "Aa", "Bb" from sidebars), lowered min strip from 8→6 - _filter_border_strip_words() lowered decorative threshold from 50%→45% - New step 4f: grid-level thin-edge-column filter as safety net — removes edge columns with <35% fill rate and >60% short text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:52:11 +01:00
Benjamin Admin	19a5f69272	fix: make Grid Editor vertically scrollable so all rows are visible The right panel (grid area) had no vertical overflow handling, causing the last ~5 rows to be clipped and invisible. Added overflow-y-auto with max-height 80vh, and removed overflow-hidden from the GridTable wrapper that was cutting off content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:33:52 +01:00
Benjamin Admin	ea09fc75df	fix: resolve circular import with lazy import for _build_reference_snapshot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:18:21 +01:00
Benjamin Admin	410d36f3de	feat: save automatic grid snapshot before manual edits for GT comparison - build-grid now saves the automatic OCR result as ground_truth.auto_grid_snapshot - mark-ground-truth includes a correction_diff comparing auto vs corrected - New endpoint GET /correction-diff returns detailed diff with per-col_type accuracy breakdown (english, german, ipa, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 13:16:44 +01:00
Benjamin Admin	72ce4420cb	fix: advance uiStep past skipped orientation for page-split sub-sessions Page-split sub-sessions (current_step=2) had orientation marked as skipped but uiStep remained at 0 (orientation step), causing StepOrientation to render for a sub-session that has no orientation data. Now advances to uiStep=1 (deskew) when orientation is skipped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 12:59:36 +01:00
Benjamin Admin	63dfb4d06f	fix: replace reset useEffects with key prop for step component remount The reset useEffects in StepOrientation/Deskew/Dewarp/Crop were clearing orientationResult when sessionId changed (e.g. during handleOrientationComplete), causing the right side of ImageCompareView to show nothing. Using key={sessionId} on the step components instead forces React to remount with fresh state when switching sessions, without interfering with the upload/orientation flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 12:20:50 +01:00
Benjamin Admin	08a91ba2be	Fix sub-session tab switching: reset step state on sessionId change CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details Step components (Deskew, Dewarp, Crop, Orientation) had local state guards that prevented reloading when sessionId changed via sub-session tab clicks. Added useEffect reset hooks that clear all local state when sessionId changes, allowing the component to properly reload the new session's data. Also renamed "Box N" to "Seite N" in BoxSessionTabs per user feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 12:04:23 +01:00
Benjamin Admin	49a36364a8	Add double-page split support to OCR Overlay (Kombi 7 Schritte) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 2m5s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 16s Details The page-split detection was only implemented in the regular pipeline page but not in the OCR Overlay page where the user actually tests with Kombi mode. Now the overlay page has full sub-session support: - openSession: handles sub_sessions, parent_session_id, skip logic for page-split vs crop-based sub-sessions, preserves current mode - handleOrientationComplete: async, fetches API to detect sub-sessions - BoxSessionTabs: shown between stepper and step content - handleNext: returns to parent after sub-session completion - handleSessionChange/handleBoxSessionsCreated: session switching Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 11:48:26 +01:00
Benjamin Admin	14fd8e0b1e	Fix page-split: fetch sub-sessions from API instead of React state CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details handleOrientationComplete was checking subSessions from React state, but due to batching the state was still empty when the user clicked "Seiten verarbeiten". Now fetches session data directly from the API to reliably detect sub-sessions and auto-open the first one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 11:22:15 +01:00
Benjamin Admin	247b79674d	Add double-page spread detection to frontend pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 34s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details After orientation detection, the frontend now automatically calls the page-split endpoint. When a double-page book spread is detected, two sub-sessions are created and each goes through the full pipeline (deskew/dewarp/crop) independently — essential because each page of a spread tilts differently due to the spine. Frontend changes: - StepOrientation: calls POST /page-split after orientation, shows split info ("Doppelseite erkannt"), notifies parent of sub-sessions - page.tsx: distinguishes page-split sub-sessions (current_step < 5) from crop-based sub-sessions (current_step >= 5). Page-split subs only skip orientation, not deskew/dewarp/crop. - page.tsx: handleOrientationComplete opens first sub-session when page-split was detected Backend changes (orientation_crop_api.py): - page-split endpoint falls back to original image when orientation rotated a landscape spread to portrait - start_step parameter: 1 if split from original, 2 if from oriented Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 11:09:44 +01:00
Benjamin Admin	40815dafd1	feat(ocr-pipeline): add page-split endpoint for double-page book spreads CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 20s Details Each page of a double-page scan tilts differently due to the book spine. The new POST /page-split endpoint detects spreads after orientation and creates sub-sessions that go through the full pipeline (deskew, dewarp, crop, etc.) individually, so each page gets its own deskew correction. Also fixes border-strip filter incorrectly removing German translation words by adding a decorative-strip validation check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 10:53:06 +01:00
Benjamin Admin	2a21127f01	fix(ocr-pipeline): improve page crop spine detection and cell assignment CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details 1. page_crop: Score all dark runs by center-proximity × darkness × narrowness instead of picking the widest. Fixes ad810209 where a wide dark area at 35% was chosen over the actual spine at 50%. 2. cv_words_first: Replace x-center-only word→column assignment with overlap-based three-pass strategy (overlap → midpoint-range → nearest). Fixes truncated German translations like "Schal" instead of "Schal - die Schals" in session 079cd0d9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 09:23:30 +01:00
Benjamin Admin	9d34c5201e	feat(grid-editor): add manual cell color control via right-click menu CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 13s Details CI / test-nodejs-website (push) Successful in 15s Details Users can now right-click any cell to set text color (red, green, blue, orange, purple, black) or remove the color bar without changing text. A "reset" option restores the OCR-detected color. This enables accurate Ground Truth marking when OCR assigns colors to wrong cells. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 08:51:18 +01:00
Benjamin Admin	d54814fa70	feat: color bar respects edits + column pattern auto-correction CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 23s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details - Color bar (red/colored indicator) now only shows when word_boxes text still matches the cell text — editing the cell hides stale colors - New "Auto-Korrektur" button: detects dominant prefix+number patterns per column (e.g. p.70, p.71) and completes partial entries (.65 → p.65) — requires 3+ matching entries before correcting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 08:38:11 +01:00
Benjamin Admin	d6f4944bcc	fix: remove maxHeight limit on grid editor — shows all rows CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 19s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 08:24:50 +01:00
Benjamin Admin	ee0d9c881e	fix: column resize handle now accessible above add/delete buttons CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Resize handle: wider (9px), z-40 (above z-30 buttons). Add-column button moved to bottom-right corner to avoid overlap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 08:20:04 +01:00
Benjamin Admin	65f4ce1947	feat: ImageLayoutEditor, arrow-key nav, multi-select bold, wider columns CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details - New ImageLayoutEditor: SVG overlay on original scan with draggable column dividers, horizontal guidelines (margins/header/footer), double-click to add columns, x-button to delete - GridTable: MIN_COL_WIDTH 40→80px for better readability - Arrow up/down keys navigate between rows in the grid editor - Ctrl+Click for multi-cell selection, Ctrl+B to toggle bold on selection - getAdjacentCell works for cells that don't exist yet (new rows/cols) - deleteColumn now merges x-boundaries correctly - Session restore fix: grid_editor_result/structure_result in session GET - Footer row 3-state cycle, auto-create cells for empty footer rows - Grid save/build/GT-mark now advance current_step=11 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 07:45:39 +01:00
Benjamin Admin	4e668660a7	feat: add Woerterbuch category + column add/delete in grid editor - New document category "Woerterbuch" (frontend type + backend validation) - Column delete: hover column header → red "x" button (with confirmation) - Column add: hover column header → "+" button inserts after that column - Both operations support undo/redo, update cell IDs and summary - Available in both GridEditor and StepGridReview (Kombi last step) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 16:27:12 +01:00
Benjamin Admin	7a6eadde8b	feat: integrate Ground Truth review into Kombi Pipeline last step - New StepGridReview component: split-view (scan image left, grid right), confidence stats, row-accept buttons, zoom controls - Kombi Pipeline case 6 now uses StepGridReview instead of plain GridEditor - Kombi step label changed to "Review & GT" - Ground Truth queue page simplified to overview/navigation only (links to Kombi pipeline for actual review work) - Deep-link support: /ai/ocr-overlay?session=xxx&mode=kombi Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 15:04:23 +01:00
Benjamin Admin	4e809c3860	fix: ground-truth crash on col_type + remove AIToolsSidebarResponsive from model-management CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details - Ground-truth: zone.columns use 'label' not 'col_type' — calling .replace() on undefined crashed the page after grid data loaded - Model-management: same AIToolsSidebarResponsive wrapper bug as the other pages — does not render children Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 10:14:02 +01:00
Benjamin Admin	dccbb909bc	fix: remove AIToolsSidebarResponsive wrapper from ground-truth and regression pages CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details AIToolsSidebarResponsive does not accept children — it renders only a sidebar nav. Using it as a wrapper caused page content to never render. Replaced with plain div, matching the pattern used by ocr-pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 09:57:52 +01:00
Benjamin Admin	be7f5f1872	feat: Sprint 2 — TrOCR ONNX, PP-DocLayout, Model Management D2: TrOCR ONNX export script (printed + handwritten, int8 quantization) D3: PP-DocLayout ONNX export script (download or Docker-based conversion) B3: Model Management admin page (PyTorch vs ONNX status, benchmarks, config) A4: TrOCR ONNX service with runtime routing (auto/pytorch/onnx via TROCR_BACKEND) A5: PP-DocLayout ONNX detection with OpenCV fallback (via GRAPHIC_DETECT_BACKEND) B4: Structure Detection UI toggle (OpenCV vs PP-DocLayout) with class color coding C3: TrOCR-ONNX.md documentation C4: OCR-Pipeline.md ONNX section added C5: mkdocs.yml nav updated, optimum added to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 09:53:02 +01:00
Benjamin Admin	c695b659fb	fix: PagePurpose props on ground-truth and regression pages CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m53s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Both pages passed `moduleId` which is not a valid prop for PagePurpose. The component expects explicit title/purpose/audience — calling audience.join() on undefined caused the client-side crash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 09:43:36 +01:00
Benjamin Admin	a1e079b911	feat: Sprint 1 — IPA hardening, regression framework, ground-truth review CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 19s Details Track A (Backend): - Compound word IPA decomposition (schoolbag→school+bag) - Trailing garbled IPA fragment removal after brackets (R21 fix) - Regression runner with DB persistence, history endpoints - Page crop determinism verified with tests Track B (Frontend): - OCR Regression dashboard (/ai/ocr-regression) - Ground Truth Review workflow (/ai/ocr-ground-truth) with split-view, confidence highlighting, inline edit, batch mark, progress tracking Track C (Docs): - OCR-Pipeline.md v5.0 (Steps 5e-5h) - Regression testing guide - mkdocs.yml nav update Track D (Infra): - TrOCR baseline benchmark script - run-regression.sh shell script - Migration 008: regression_runs table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 09:21:27 +01:00
Benjamin Admin	f5d5d6c59c	docs: add Vision, Roadmap, and Hardware strategy to MkDocs CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 18s Details Add three new Projekt documentation pages covering product vision (offline-first desktop app for teachers), 6-phase development roadmap, and 3-tier hardware strategy with distribution plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 08:54:22 +01:00
Benjamin Admin	4a44ad7986	fix: hard-filter OCR words inside detected graphic regions CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 16s Details Run detect_graphic_elements() in the grid pipeline after image loading and remove ALL words whose centroids fall inside detected graphic regions, regardless of confidence. Previously only low-confidence words (conf < 50) were removed, letting artifacts like "Tr", "Su" survive. Changes: - grid_editor_api.py: Import and call detect_graphic_elements() at Step 3a, passing only significant words (len >= 3) to avoid short artifacts fooling the text-vs-graphic heuristic. Hard-filter all words in graphic regions. - cv_graphic_detect.py: Lower density threshold from 20% to 5% for large regions (>100x80px) — photos/illustrations have low color saturation. Raise page-spanning limit from 50% to 60% width/height. Tested: 5 ground-truth sessions pass regression (079cd0d9, d8533a2c, 2838c7a7, 4233d7e3, 5997b635). Session 5997 now detects 2 graphic regions and removes 29 artifact words including "Tr" and "Su". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 10:18:23 +01:00
Benjamin Admin	7b3319be2e	fix: merge syllable-split word_boxes + keep dictionary guide words CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m56s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 17s Details OCR splits words at syllable marks into overlapping word_boxes (e.g. "zu" + "tiefst" with 52% x-overlap). Step 5i previously removed the lower-confidence box, losing the prefix. Now: when both boxes are alphabetic text with 20-75% overlap, MERGE them into one word_box ("zutiefst") instead of removing. Also relaxed artifact cell filter: 2-char alphabetic text like "Zw" (dictionary guide word) is no longer removed. Only non-alphabetic short text like "a=" is filtered. Results for session 5997: "tiefst"→"zutiefst", "zu"→"zuständig", "Zu die Zuschüsse"→"Zuschuss, die Zuschüsse", "Zw" restored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 08:21:00 +01:00
Benjamin Admin	882b177fc3	fix: remove image-area artifacts + fix heading false positive for dictionary entries CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 28s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Three fixes for dictionary page session 5997: 1. Heading detection: column_1 cells with article words (die/der/das) now count as content cells, preventing "die Zuschrift, die Zuschriften" from being falsely merged into a spanning heading cell. 2. Step 5j-pre: new artifact cell filter removes short garbled text from OCR on image areas (e.g. "7 EN", "Tr", "\\", "PEE", "a="). Cells survive earlier filters because their rows have real content in other columns. Also cleans up empty rows after removal. 3. Footer "PEE" auto-fixed: artifact filter removes the noise cell, empty row gets cleaned up, footer detection no longer sees it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 07:59:24 +01:00
Benjamin Admin	1fae39dbb8	fix: lower secondary column threshold + strip pipe chars from word_boxes CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 18s Details Dictionary pages have 2 dictionary columns, each with article + headword sub-columns. The right article column (die/der at x≈626) had only 14.3% row coverage — below the 20% secondary threshold. Lowered to 12% so dictionary article columns qualify. Also strip pipe characters from individual word_box text (not just cell text) to remove OCR syllable separation marks (e.g. "zu\|trau\|en" → "zutrauen"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 07:44:03 +01:00
Benjamin Admin	46c8c28d34	fix: border strip pre-filter + 3-column detection for vocabulary tables The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly removed base words along with edge artifacts. Now uses a two-stage approach: 1. _filter_border_strip_words() pre-filters raw words BEFORE column detection, scanning from the page edge inward to find the FIRST significant gap (>30px) 2. Step 4e runs as fallback only when pre-filter didn't apply Session 4233 now correctly detects 3 columns (base word \| oder \| synonyms) instead of 2. Threshold raised from 15% to 20% to handle pages with many edge artifacts. All 4 ground-truth sessions pass regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 21:01:43 +01:00
Benjamin Admin	4000110501	fix: extend tiny symbol filter to all non-black colors, raise area to 200 CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m49s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details Step 5i rule (a) only caught blue tiny symbols. Graphic fragments from page illustrations (e.g. orange quote mark from man illustration) were missed. Now filters any non-black colored word_box with area < 200 and confidence < 85. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 18:05:31 +01:00
Benjamin Admin	2acf8696bf	fix: correct border strip test data to avoid false internal gaps CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details Content word_boxes in test used x-spacing (i%3)100 which created internal gaps larger than the border-to-content gap. Changed to (i%2)51 so content words overlap and the border gap remains dominant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 17:24:33 +01:00
Benjamin Admin	c0e1118870	feat: detect and remove page-border decoration strip artifacts (Step 4e) Textbooks with decorative alphabet strips along page edges produce OCR artifacts (scattered colored letters at x<150 while real content starts at x>=179). Step 4e detects a significant x-gap (>30px) between a small cluster (<15% of total word_boxes) near the page edge and the main content, then removes the border-strip word_boxes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 17:20:45 +01:00
Benjamin Admin	f31a7175a2	fix: normalize word_box order to reading order for frontend display (Step 5j) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details The frontend renders colored cells from the word_boxes array order, not from cell.text. After post-processing steps (5i bullet removal etc), word_boxes could remain in their original insertion order instead of left-to-right reading order. Step 5j now explicitly sorts word_boxes using _group_words_into_lines before the result is built. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 19:21:37 +01:00
Benjamin Admin	bacbfd88f1	Fix word ordering in cell text rebuild (Steps 4c, 4d, 5i) Cell text was rebuilt using naive (top, left) sorting after removing word_boxes in Steps 4c/4d/5i. This produced wrong word order when words on the same visual line had slightly different top values (1-6px). Now uses _words_to_reading_order_text() which groups words into visual lines by y-tolerance before sorting by x within each line, matching the initial cell text construction in _build_cells. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:45:33 +01:00
Benjamin Admin	2c63beff04	Fix bullet overlap disambiguation + raise red threshold to 90 Step 5i: For word_boxes with >90% x-overlap and different text, use IPA dictionary to decide which to keep (e.g. "tightly" in dict, "fighily" not). Red threshold raised from 80 to 90 to catch remaining scanner artifacts like "tight" and "5" that were still misclassified as red. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:21:00 +01:00
Benjamin Admin	82433b4bad	Step 5i: Remove blue bullet/artifact and overlapping duplicate word_boxes Dictionary pages have small blue square bullets before entries that OCR reads as text artifacts. Three detection rules: a) Tiny blue symbols (area < 150, conf < 85): catches ©, e, * etc. b) X-overlapping word_boxes (>40%): remove lower confidence one c) Duplicate blue text with gap < 6px: remove one copy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 18:17:07 +01:00
Benjamin Admin	d889a6959e	Fix red false-positive in color detection for scanned black text Scanner artifacts on black text produce slight warm tint (hue ~0, sat ~60) that was misclassified as red. Now requires median_sat >= 80 specifically for red classification, since genuine red text always has high saturation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 17:18:44 +01:00
Benjamin Admin	bc1804ad18	Fix vsplit side-by-side rendering: invalid TypeScript type annotation Changed `typeof grid.zones[][]` to `GridZone[][]` which was causing a silent build error, preventing the vsplit zone grouping logic from being compiled into the production bundle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 17:09:52 +01:00
Benjamin Admin	45b83560fd	Vertical zone split: detect divider lines and create independent sub-zones Pages with two side-by-side vocabulary columns separated by a vertical black line are now split into independent sub-zones before row/column detection. Each sub-zone gets its own rows, preventing misalignment from different heading rhythms. - _detect_vertical_dividers(): finds pipe word_boxes at consistent x positions spanning >50% of zone height - _split_zone_at_vertical_dividers(): creates left/right PageZone objects with layout_hint and vsplit_group metadata - Column union skips vsplit zones (independent column sets) - Frontend renders vsplit zones side by side via flex layout - PageZone gets layout_hint + vsplit_group fields Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 16:38:12 +01:00
Benjamin Admin	e4fa634a63	Fix GridTable: show cell.text when it diverges from word_boxes Post-processing steps like 5h (slash-IPA conversion) modify cell.text but not individual word_boxes. The colored per-word display showed stale word_box text instead of the corrected cell text. Now falls back to the plain input when texts don't match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 15:05:10 +01:00
Benjamin Admin	76ba83eecb	Tighten tertiary column detection: require 4+ rows and 5% coverage Prevents false narrow columns from text overflow at page edges. Session 355f3c84 had a 3-row/4% tertiary cluster creating a spurious third column from right-column text overflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:50:03 +01:00
Benjamin Admin	04092a0a66	Fix Step 5h: reject grammar patterns in slash-IPA, convert trailing variants - Reject /.../ matches containing spaces, parens, or commas (e.g. sb/sth up) - Second pass converts trailing /ipa2/ after [ipa1] (double pronunciation) - Validate standalone /ipa/ at start against same reject pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:40:28 +01:00
Benjamin Admin	7fafd297e7	Step 5h: convert slash-delimited IPA to bracket notation with dict lookup Dictionary-style pages print IPA between slashes (e.g. tiger /'taiga/). Step 5h detects these patterns, looks up the headword in the IPA dictionary for proper Unicode IPA, and falls back to OCR text when not found. Converts /ipa/ to [ipa] bracket notation matching the rest of the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:36:08 +01:00
Benjamin Admin	7ac09b5941	Filter pipe-character word_boxes from OCR column divider artifacts Step 4d removes "\|" and "\|\|" word_boxes that OCR produces when reading physical vertical divider lines between columns. Also strips stray pipe chars from cell text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 12:09:50 +01:00
Benjamin Admin	1f7989cfc2	Fix grammar bracket detection: split on spaces too, not just slashes CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m48s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 15s Details _is_grammar_bracket_content now splits "no pl" into ["no", "pl"] instead of treating it as single token "no pl". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:45:35 +01:00
Benjamin Admin	ef5aed6a98	Preserve grammar annotations (pl), (no pl) and skip articles in IPA CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details Two fixes: 1. Add pl, sg, no, also, ae, be etc. to _GRAMMAR_BRACKET_WORDS so annotations like (pl) and (no pl) are not replaced with IPA. 2. Skip articles (the, a, an) in fix_ipa_continuation_cell — they never get IPA in vocabulary books. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:42:44 +01:00
Benjamin Admin	7dc00e737a	Add footer row label (F) in grid editor, matching header (H) style CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m40s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Footer rows (e.g. page numbers) now show "F" in amber below the row number, mirroring the blue "H" label for headers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 11:01:14 +01:00
Benjamin Admin	a579c31ddb	Fix IPA continuation: skip words with inline IPA, recover emptied cells CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details Three fixes: 1. fix_ipa_continuation_cell: when headword has inline IPA like "beat [bˈiːt] , beat, beaten", only generate IPA for uncovered words (beaten), not words already shown (beat). When bracket is at end like "the Highlands [ˈhaɪləndz]", return inline IPA directly. 2. Step 5d: recover garbled IPA from word_boxes when Step 5c emptied the cell text (e.g. "[n, nn]" → ""). 3. Added 2 tests for inline IPA behavior (35 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:31:54 +01:00
Benjamin Admin	0f9c0d2ad0	Keep footer rows in table, mark with is_footer + col_type=footer Footer rows like "two hundred and twelve" are no longer removed from the grid. Instead they stay in cells/rows and get tagged so the frontend can render them differently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 09:08:25 +01:00

1 2 3 4 5 ...

469 Commits