breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	4561320e0d	Fix SmartSpellChecker: preserve leading non-alpha text like (= Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m36s Details CI / test-python-agent-core (push) Successful in 35s Details CI / test-nodejs-website (push) Successful in 33s Details The tokenizer regex only matches alphabetic characters, so text before the first word match (like "(= " in "(= I won...") was silently dropped when reassembling the corrected text. Now preserves text[:first_match_start] as a leading prefix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 23:41:33 +02:00
Benjamin Admin	596864431b	Rule (a2): switch from allow-list to block-list for symbol removal Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m42s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 36s Details Instead of keeping only specific symbols (_KEEP_SYMBOLS), now only removes explicitly decorative symbols (_REMOVE_SYMBOLS: > < ~ \ ^ etc). All other punctuation (= ( ) ; : - etc.) is preserved by default. This is more robust: any new symbol used in textbooks will be kept unless it's in the small block-list of known decorative artifacts. Fixes: (= token still being removed on page 5 despite being in the allow-list (possibly due to Unicode variants or whitespace). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 23:34:21 +02:00
Benjamin Admin	c8027eb7f9	Fix: preserve = ; : - and other meaningful symbols in word_boxes Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m38s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Rule (a2) in Step 5i removed word_boxes with no letters/digits as "graphic OCR artifacts". This incorrectly removed = signs used as definition markers in textbooks ("film = 1. Film; 2. filmen"). Added exception list _KEEP_SYMBOLS for meaningful punctuation: = (= =) ; : - – — / + • · ( ) & * → ← ↔ The root cause: PaddleOCR returns "film = 1. Film; 2. filmen" as one block, which gets split into word_boxes ["film", "=", "1.", ...]. The "=" word_box had no alphanumeric chars and was removed as artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 23:18:35 +02:00
Benjamin Admin	ba0f659d1e	Preserve = and (= tokens in grid build and cell text cleanup Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m34s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 42s Details = signs are used as definition markers in textbooks ("film = 1. Film"). They were incorrectly removed by two filters: 1. grid_build_core.py Step 5j-pre: _PURE_JUNK_RE matched "=" as artifact noise. Now exempts =, (=, ;, :, - and similar meaningful punctuation tokens. 2. cv_ocr_engines.py _is_noise_tail_token: "pure non-alpha" check removed trailing = tokens. Now exempts meaningful punctuation. Fixes: "film = 1. Film; 2. filmen" losing the = sign, "(= I won and he lost.)" losing the (=. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 23:04:27 +02:00
Benjamin Admin	50bfd6e902	Fix gutter repair: don't suggest corrections for words with parentheses Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 50s Details CI / test-python-klausur (push) Failing after 2m37s Details CI / test-python-agent-core (push) Successful in 40s Details CI / test-nodejs-website (push) Successful in 31s Details Words like "probieren)" or "Englisch)" were incorrectly flagged as gutter OCR errors because the closing parenthesis wasn't stripped before dictionary lookup. The spellchecker then suggested "probierend" (replacing ) with d, edit distance 1). Two fixes: 1. Strip trailing/leading parentheses in _try_spell_fix before checking if the bare word is valid — skip correction if it is 2. Add )( to the rstrip characters in the analysis phase so "probieren)" becomes "probieren" for the known-word check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 22:38:22 +02:00
Benjamin Admin	0599c72cc1	Fix IPA continuation: don't replace normal text with IPA Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 35s Details CI / test-nodejs-website (push) Successful in 19s Details Text like "Betonung auf der 1. Silbe: profit ['profit]" was incorrectly detected as garbled IPA and replaced with generated IPA transcription of the previous row's example sentence. Added guard: if the cell text contains >=3 recognizable words (3+ letter alpha tokens), it's normal text, not garbled IPA. Garbled IPA is typically short and has no real dictionary words. Fixes: Row 13 C3 showing IPA instead of pronunciation hint text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 22:28:58 +02:00
Benjamin Admin	3d3c2b30db	Add tests for unified_grid and cv_box_layout Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m30s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 34s Details test_unified_grid.py (10 tests): - Dominant row height calculation (regular, gaps filtered, single row) - Box classification (full-width, partial left/right, text line count) - Unified grid building (content-only, box integration, cell tagging) test_box_layout.py (13 tests): - Layout classification (header_only, flowing, bullet_list) - Line grouping by y-proximity - Flowing layout indent grouping (bullet + continuations → \n) - Row/column field completeness for GridTable compatibility Total: 66 tests passing (43 smart_spell + 13 box_layout + 10 unified) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 18:18:52 +02:00
Benjamin Admin	17f0fdb2ed	Refactor: extract _build_grid_core into grid_build_core.py + clean StepAnsicht Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 19s Details CI / test-go-edu-search (push) Failing after 23s Details CI / test-python-klausur (push) Failing after 10s Details CI / test-python-agent-core (push) Failing after 9s Details CI / test-nodejs-website (push) Failing after 26s Details grid_editor_api.py: 2411 → 474 lines - Extracted _build_grid_core() (1892 lines) into grid_build_core.py - API file now only contains endpoints (build, save, get, gutter, box, unified) StepAnsicht.tsx: 212 → 112 lines - Removed useGridEditor imports (not needed for read-only spreadsheet) - Removed unified grid fetch/build (not used with multi-sheet approach) - Removed Spreadsheet/Grid toggle (only spreadsheet mode now) - Simple: fetch grid-editor data → pass to SpreadsheetView Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 08:54:55 +02:00
Benjamin Admin	c1a903537b	Unified Grid: merge all zones into single Excel-like grid Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 32s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m35s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 33s Details Backend (unified_grid.py): - build_unified_grid(): merges content + box zones into one zone - Dominant row height from median of content row spacings - Full-width boxes: rows integrated directly - Partial-width boxes: extra rows inserted when box has more text lines than standard rows fit (e.g., 7 lines in 5-row height) - Box-origin cells tagged with source_zone_type + box_region metadata Backend (grid_editor_api.py): - POST /sessions/{id}/build-unified-grid → persists as unified_grid_result - GET /sessions/{id}/unified-grid → retrieve persisted result Frontend: - GridEditorCell: added source_zone_type, box_region fields - GridTable: box-origin cells get tinted background + left border - StepAnsicht: split-view with original image (left) + editable unified GridTable (right). Auto-builds on first load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 23:37:55 +02:00
Benjamin Admin	b5900f1aff	Bullet indentation detection: group continuation lines into bullets Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 34s Details Flowing/bullet_list layout now analyzes left-edge indentation: - Lines at minimum indent = bullet start / main level - Lines indented >15px more = continuation (belongs to previous bullet) - Continuation lines merged with \n into parent bullet cell - Missing bullet markers (•) auto-added when pattern is clear Example: 7 OCR lines → 3 items (1 header + 2 bullets × 3 lines each) "German leihen" header, then two bullet groups with indented examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 16:57:16 +02:00
Benjamin Admin	baac98f837	Filter false-positive boxes in header/footer margins Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 55s Details CI / test-go-edu-search (push) Successful in 1m0s Details CI / test-python-klausur (push) Failing after 2m35s Details CI / test-python-agent-core (push) Successful in 27s Details CI / test-nodejs-website (push) Successful in 27s Details Boxes whose vertical center falls within top/bottom 7% of image height are filtered out (page numbers, unit headers, running footers). At typical scan resolutions, 7% ≈ 2.5cm margin. Fixes: "Box 1" containing just "3" from "Unit 3" page header being incorrectly treated as an embedded box. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:38:53 +02:00
Benjamin Admin	496d34d822	Fix box empty rows: add x_min_px/x_max_px to flowing/header columns Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 55s Details CI / test-go-edu-search (push) Successful in 51s Details CI / test-python-klausur (push) Failing after 2m7s Details CI / test-python-agent-core (push) Successful in 26s Details CI / test-nodejs-website (push) Successful in 31s Details GridTable calculates column widths from col.x_max_px - col.x_min_px. Flowing and header_only layouts were missing these fields, producing NaN widths which collapsed the CSS grid layout and showed empty rows with only row numbers visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:01:11 +02:00
Benjamin Admin	7b3e8c576d	Fix NameError: span_cells removed but still referenced in log Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 51s Details CI / test-python-klausur (push) Failing after 2m42s Details CI / test-python-agent-core (push) Successful in 39s Details CI / test-nodejs-website (push) Successful in 38s Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:20:11 +02:00
Benjamin Admin	868f99f109	Fix colspan text + box row fields for GridTable compatibility Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 42s Details CI / test-nodejs-website (push) Successful in 33s Details Colspan: use original word-block text instead of split cell texts. Prevents "euros a nd cents" from split_cross_column_words. Box rows: add is_header field (was undefined, causing GridTable rendering issues). Add y_min_px/y_max_px to header_only rows. These missing fields caused empty rows with only row numbers visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 12:08:49 +02:00
Benjamin Admin	dc25f243a4	Fix colspan: use original words before split_cross_column_words Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m33s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 35s Details _split_cross_column_words was destroying the colspan information by cutting word-blocks at column boundaries BEFORE _detect_colspan_cells could analyze them. Now passes original (pre-split) words to colspan detection while using split words for cell building. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:58:32 +02:00
Benjamin Admin	c62ff7cd31	Generic colspan detection for merged cells in grids and boxes Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-python-klausur (push) Failing after 2m45s Details CI / test-python-agent-core (push) Successful in 38s Details CI / test-nodejs-website (push) Successful in 34s Details New _detect_colspan_cells() in grid_editor_helpers.py: - Runs after _build_cells() for every zone (content + box) - Detects word-blocks that extend across column boundaries - Merges affected cells into spanning_header with colspan=N - Uses column midpoints to determine which columns are covered - Works for full-page scans and box zones equally Also fixes box flowing/bullet_list row height fields (y_min_px/y_max_px). Removed duplicate spanning logic from cv_box_layout.py — now uses the generic _detect_colspan_cells from grid_editor_helpers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 11:38:03 +02:00
Benjamin Admin	5d91698c3b	Fix box grid: row height fields + spanning cell detection Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 46s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m36s Details CI / test-python-agent-core (push) Successful in 33s Details CI / test-nodejs-website (push) Successful in 37s Details Box 3 empty rows: flowing/bullet_list rows were missing y_min_px/ y_max_px fields that GridTable uses for row height calculation. Added _px and _pct variants. Box 2 spanning cells: rows with fewer word-blocks than columns (e.g., "In Britain..." spanning 2 columns) are now detected and merged into spanning_header cells. GridTable already renders spanning_header cells across the full row width. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 09:46:43 +02:00
Benjamin Admin	5fa5767c9a	Fix box column detection: use low gap_threshold for small zones Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m48s Details CI / test-python-agent-core (push) Successful in 38s Details CI / test-nodejs-website (push) Successful in 30s Details PaddleOCR returns multi-word blocks (whole phrases), so ALL inter-word gaps in small zones (boxes, ≤60 words) are column boundaries. Previous 3x-median approach produced thresholds too high to detect real columns. New approach for small zones: gap_threshold = max(median_h * 1.0, 25). This correctly detects 4 columns in "Pounds and euros" box where gaps range from 50-297px and word height is ~31px. Also includes SmartSpellChecker fixes from previous commits: - Frequency-based scoring, IPA protection, slash→l, rare-word threshold Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 07:55:29 +02:00
Benjamin Admin	693803fb7c	SmartSpellChecker: frequency scoring, IPA protection, slash→l fix Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m55s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 31s Details Major improvements: - Frequency-based boundary repair: always tries repair, uses word frequency product to decide (Pound sand→Pounds and: 2000x better) - IPA bracket protection: words inside [brackets] are never modified, even when brackets land in tokenizer separators - Slash→l substitution: "p/" → "pl" for italic l misread as slash - Abbreviation guard uses rare-word threshold (freq < 1e-6) instead of binary known/unknown — prevents "Can I" → "Ca nI" while still fixing "ats th." → "at sth." - Tokenizer includes / character for slash-word detection 43 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 07:36:39 +02:00
Benjamin Admin	31089df36f	SmartSpellChecker: frequency-based boundary repair for valid word pairs Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 40s Details CI / test-python-klausur (push) Failing after 2m42s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 35s Details Previously, boundary repair was skipped when both words were valid dictionary words (e.g., "Pound sand", "wit hit", "done euro"). Now uses word-frequency scoring (product of bigram frequencies) to decide if the repair produces a more common word pair. Threshold: repair accepted when new pair is >5x more frequent, or when repair produces a known abbreviation. New fixes: Pound sand→Pounds and (2000x), wit hit→with it (100000x), done euro→one euro (7x). 43 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 07:00:22 +02:00
Benjamin Admin	7b294f9150	Cap gap_threshold at 25% of zone_w for column detection Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 46s Details CI / test-go-edu-search (push) Successful in 52s Details CI / test-python-klausur (push) Failing after 2m51s Details CI / test-python-agent-core (push) Successful in 40s Details CI / test-nodejs-website (push) Successful in 34s Details In small zones (boxes), intra-phrase gaps inflate the median gap, causing gap_threshold to become too large to detect real column boundaries. Cap at 25% of zone width to prevent this. Example: Box "Pounds and euros" has 4 columns at x≈148,534,751,1137 but gap_threshold was 531 (larger than the column gaps themselves). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 23:58:15 +02:00
Benjamin Admin	058eadb0e4	Fix build-box-grids: use structure_result boxes + raw OCR words Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 48s Details CI / test-go-edu-search (push) Successful in 44s Details CI / test-python-klausur (push) Failing after 2m47s Details CI / test-python-agent-core (push) Successful in 33s Details CI / test-nodejs-website (push) Successful in 36s Details - Source boxes from structure_result (Step 7) instead of grid zones - Use raw_paddle_words (top/left/width/height) instead of grid cells - Create new box zones from all detected boxes (not just existing zones) - Sort zones by y-position for correct reading order - Include box background color metadata Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:50:28 +02:00
Benjamin Admin	5da9a550bf	Add Box-Grid-Review step (Step 11) to OCR pipeline Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 44s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m52s Details CI / test-python-agent-core (push) Successful in 36s Details CI / test-nodejs-website (push) Successful in 37s Details New pipeline step between Gutter Repair and Ground Truth that processes embedded boxes (grammar tips, exercises) independently from the main grid. Backend: - cv_box_layout.py: classify_box_layout() detects flowing/columnar/ bullet_list/header_only layout types per box - build_box_zone_grid(): layout-aware grid building (single-column for flowing text, independent columns for tabular content) - POST /sessions/{id}/build-box-grids endpoint with SmartSpellChecker - Layout type overridable per box via request body Frontend: - StepBoxGridReview.tsx: shows each box with cropped image + editable GridTable. Layout type dropdown per box. Auto-builds on first load. - Auto-skip when no boxes detected on page - Pipeline steps updated: 13 steps (0-12), Ground Truth moved to 12 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 17:26:06 +02:00
Benjamin Admin	52637778b9	SmartSpellChecker: boundary repair + context split + abbreviation awareness Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 51s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m54s Details CI / test-python-agent-core (push) Successful in 35s Details CI / test-nodejs-website (push) Successful in 35s Details New features: - Boundary repair: "ats th." → "at sth." (shifted OCR word boundaries) Tries shifting 1-2 chars between adjacent words, accepts if result includes a known abbreviation or produces better dictionary matches - Context split: "anew book" → "a new book" (ambiguous word merges) Explicit allow/deny list for article+word patterns (alive, alone, etc.) - Abbreviation awareness: 120+ known abbreviations (sth, sb, adj, etc.) are now recognized as valid words, preventing false corrections - Quality gate: boundary repairs only accepted when result scores higher than original (known words + abbreviations) 40 tests passing, all edge cases covered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 15:41:17 +02:00
Benjamin Admin	f6372b8c69	Integrate SmartSpellChecker into build-grid finalization Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m45s Details CI / test-python-agent-core (push) Successful in 36s Details CI / test-nodejs-website (push) Successful in 40s Details SmartSpellChecker now runs during grid build (not just LLM review), so corrections are visible immediately in the grid editor. Language detection per column: - EN column detected via IPA signals (existing logic) - All other columns assumed German for vocab tables - Auto-detection for single/two-column layouts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 14:54:01 +02:00
Benjamin Admin	909d0729f6	Add SmartSpellChecker + refactor vocab-worksheet page.tsx Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m51s Details CI / test-python-agent-core (push) Successful in 36s Details CI / test-nodejs-website (push) Successful in 37s Details SmartSpellChecker (klausur-service): - Language-aware OCR post-correction without LLMs - Dual-dictionary heuristic for EN/DE language detection - Context-based a/I disambiguation via bigram lookup - Multi-digit substitution (sch00l→school) - Cross-language guard (don't false-correct DE words in EN column) - Umlaut correction (Schuler→Schüler, uber→über) - Integrated into spell_review_entries_sync() pipeline - 31 tests, 9ms/100 corrections Vocab-worksheet refactoring (studio-v2): - Split 2337-line page.tsx into 14 files - Custom hook useVocabWorksheet.ts (all state + logic) - 9 components in components/ directory - types.ts, constants.ts for shared definitions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 12:25:01 +02:00
Benjamin Admin	0f17eb3cd9	Fix IPA:Aus — strip all brackets before skipping IPA block Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 49s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m53s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has started running Details When ipa_mode=none, the entire IPA processing block was skipped, including the bracket-stripping logic. Now strips ALL square brackets from content columns BEFORE the skip, so IPA:Aus actually removes all IPA from the display. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:05:22 +02:00
Benjamin Admin	a6c5f56003	Fix IPA strip: match all square brackets, not just Unicode IPA Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 29s Details CI / test-nodejs-website (push) Successful in 23s Details OCR text contains ASCII IPA approximations like [kompa'tifn] instead of Unicode [kˈɒmpətɪʃən]. The strip regex required Unicode IPA chars inside brackets and missed the ASCII ones. Now strips all [bracket] content from excluded columns since square brackets in vocab columns are always IPA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:53:16 +02:00
Benjamin Admin	584e07eb21	Strip English IPA when mode excludes EN (nur DE / Aus) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 46s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details English IPA from the original OCR scan (e.g. [ˈgrænˌdæd]) was always shown because fix_cell_phonetics only ADDS/CORRECTS but never removes. Now strips IPA brackets containing Unicode IPA chars from the EN column when ipa_mode is "de" or "none". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:49:22 +02:00
Benjamin Admin	ad78e26143	Fix word-split: handle IPA brackets, contractions, and tiebreaker Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 46s Details CI / test-python-klausur (push) Failing after 2m57s Details CI / test-python-agent-core (push) Successful in 36s Details CI / test-nodejs-website (push) Successful in 41s Details 1. Strip IPA brackets [ipa] before attempting word split, so "makeadecision[dɪsˈɪʒən]" is processed as "makeadecision" 2. Handle contractions: "solet's" → split "solet" → "so let" + "'s" 3. DP tiebreaker: prefer longer first word when scores are equal ("task is" over "ta skis") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:13:02 +02:00
Benjamin Admin	4f4e6c31fa	Fix word-split tiebreaker: prefer longer first word Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m44s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 35s Details "taskis" was split as "ta skis" instead of "task is" because both have the same DP score. Changed comparison from > to >= so that later candidates (with longer first words) win ties. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:05:14 +02:00
Benjamin Admin	7ffa4c90f9	Lower word-split threshold from 7 to 4 chars Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 46s Details CI / test-python-klausur (push) Failing after 2m48s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 38s Details Short merged words like "anew" (a new), "Imadea" (I made a), "makeadecision" (make a decision) were missed because the split threshold was too high. Now processes tokens >= 4 chars. English single-letter words (a, I) are already handled by the DP algorithm which allows them as valid split points. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:59:02 +02:00
Benjamin Admin	656cadbb1e	Remove page-number footers from grid, promote to metadata Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 40s Details CI / test-python-klausur (push) Failing after 2m55s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 37s Details Footer rows that are page numbers (digits or written-out like "two hundred and nine") are now removed from the grid entirely and promoted to the page_number metadata field. Non-page-number footer content stays as a visible footer row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:50:20 +02:00
Benjamin Admin	757c8460c9	Detect written-out page numbers as footer rows Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 44s Details CI / test-python-klausur (push) Failing after 2m46s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 39s Details "two hundred and nine" (22 chars) was kept as a content row because the footer detection only accepted text ≤20 chars. Now recognizes written-out number words (English + German) as page numbers regardless of length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:39:43 +02:00
Benjamin Admin	501de4374a	Keep page references as visible column cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 35s Details Step 5g was extracting page refs (p.55, p.70) as zone metadata and removing them from the cell table. Users want to see them as a separate column. Now keeps cells in place while still extracting metadata for the frontend header display. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:27:44 +02:00
Benjamin Admin	774bbc50d3	Add debug logging for empty-column-removal Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 54s Details CI / test-python-klausur (push) Failing after 2m53s Details CI / test-python-agent-core (push) Successful in 39s Details CI / test-nodejs-website (push) Successful in 39s Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:45:22 +02:00
Benjamin Admin	9ceee4e07c	Protect page references from junk-row removal Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 11s Details CI / test-go-edu-search (push) Successful in 57s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details Rows containing only a page reference (p.55, S.12) were removed as "oversized stubs" (Rule 2) when their word-box height exceeded the median. Now skips Rule 2 if any word matches the page-ref pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:40:37 +02:00
Benjamin Admin	f23aaaea51	Fix false header detection: skip continuation lines and mid-column cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 57s Details CI / test-python-klausur (push) Failing after 2m57s Details CI / test-python-agent-core (push) Successful in 28s Details CI / test-nodejs-website (push) Successful in 34s Details Single-cell rows were incorrectly detected as headings when they were actually continuation lines. Two new guards: 1. Text starting with "(" is a continuation (e.g. "(usw.)", "(TV-Serie)") 2. Single cells beyond the first two content columns are overflow lines, not headings. Real headings appear in the first columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:21:09 +02:00
Benjamin Admin	cde13c9623	Fix IPA stripping digits after headwords (Theme 1 → Theme) Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 46s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m46s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 30s Details _insert_missing_ipa stripped "1" from "Theme 1" because it treated the digit as garbled OCR phonetics. Now treats pure digits/numbering patterns (1, 2., 3)) as delimiters that stop the garble-stripping. Also fixes _has_non_dict_trailing which incorrectly flagged "Theme 1" as having non-dictionary trailing text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:13:45 +02:00
Benjamin Admin	2e42167c73	Remove empty columns from grid zones Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 52s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m43s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 29s Details Columns with zero cells (e.g. from tertiary detection where the word was assigned to a neighboring column by overlap) are stripped from the final result. Remaining columns and cells are re-indexed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:04:49 +02:00
Benjamin Admin	5eff4cf877	Fix page refs deleted as artifacts + IPA spacing for DE mode Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has started running Details 1. Step 5j-pre wrongly classified "p.43", "p.50" etc as artifacts (mixed digits+letters, <=5 chars). Added exception for page reference patterns (p.XX, S.XX). 2. IPA spacing regex was too narrow (only matched Unicode IPA chars). Now matches any [bracket] content >=2 chars directly after a letter, fixing German IPA like "Opa[oːpa]" → "Opa [oːpa]". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:01:25 +02:00
Benjamin Admin	7f4b8757ff	Fix IPA spacing + add zone debug logging for marker column issue Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 55s Details CI / test-go-edu-search (push) Successful in 49s Details CI / test-python-klausur (push) Failing after 2m48s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 37s Details 1. Ensure space before IPA brackets in cell text: "word[ipa]" → "word [ipa]" Applied as final cleanup in grid-build finalization. 2. Add debug logging for zone-word assignment to diagnose why marker column cells are empty despite correct column detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:51:52 +02:00
Benjamin Admin	7263328edb	Fix marker column detection: remove min-rows requirement Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m55s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 22s Details Words to the left of the first detected column boundary must always form their own column, regardless of how few rows they appear in. Previously required 4+ distinct rows for tertiary (margin) columns, which missed page references like p.62, p.63, p.64 (only 3 rows). Now any cluster at the left/right margin with a clear gap to the nearest significant column qualifies as its own column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:24:25 +02:00
Benjamin Admin	00f7a7154c	Fix left-side gutter detection: find peak instead of scanning from edge Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 32s Details Left-side book fold shadows have a V-shape: brightness dips from the edge toward a peak at ~5-10% of width, then rises again. The previous algorithm scanned from the edge inward and immediately found a low dark fraction (0.13 at x=0), missing the gutter entirely. Now finds the PEAK of the dark fraction profile first, then scans from that peak toward the page center to find the transition point. Works for both V-shaped left gutters and edge-darkening right gutters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 16:52:23 +02:00
Benjamin Admin	9c5e950c99	Fix multi-page PDF upload: include session_id for first page Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-nodejs-website (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 10m2s Details CI / test-go-edu-search (push) Failing after 10m9s Details CI / test-python-agent-core (push) Failing after 14m58s Details The frontend expects session_id in the upload response, but multi-page PDFs returned only document_group_id + pages[]. Now includes session_id pointing to the first page for backwards compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 16:26:25 +02:00
Benjamin Admin	6e494a43ab	Apply merged-word splitting to grid-editor cells Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 44s Details CI / test-python-klausur (push) Failing after 2m28s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 32s Details The spell review only runs on vocab entries, but the OCR pipeline's grid-editor cells also contain merged words (e.g. "atmyschool"). Now splits merged words directly in the grid-build finalization step, right before returning the result. Uses the same _try_split_merged_word() dictionary-based DP algorithm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:52:00 +02:00
Benjamin Admin	53b0d77853	Multi-page PDF support: create one session per page Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 27s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m36s Details CI / test-python-agent-core (push) Successful in 24s Details CI / test-nodejs-website (push) Successful in 35s Details When uploading a PDF with > 1 page to the OCR pipeline, each page now gets its own session (grouped by document_group_id). Previously only page 1 was processed. The response includes a pages array with all session IDs so the frontend can navigate between them. Single-page PDFs and images continue to work as before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:39:48 +02:00
Benjamin Admin	aed0edbf6d	Fix word split scoring: prefer longer words over short ones Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 20s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m41s Details CI / test-python-agent-core (push) Successful in 24s Details CI / test-nodejs-website (push) Successful in 30s Details "Comeon" was split as "Com eon" instead of "Come on" because both are 2-word splits. Now uses sum-of-squared-lengths as tiebreaker: "come"(16) + "on"(4) = 20 > "com"(9) + "eon"(9) = 18. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:14:23 +02:00
Benjamin Admin	9e2c301723	Add merged-word splitting to OCR spell review Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details OCR often merges adjacent words when spacing is tight, e.g. "atmyschool" → "at my school", "goodidea" → "good idea". New _try_split_merged_word() uses dynamic programming to find the shortest sequence of dictionary words covering the token. Integrated as step 5 in _spell_fix_token() after general spell correction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:11:16 +02:00
Benjamin Admin	633e301bfd	Add camera gutter detection via vertical continuity analysis Some checks failed CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 32s Details Scanner shadow detection (range > 40, darkest < 180) fails on camera book scans where the gutter shadow is subtle (range ~25, darkest ~214). New _detect_gutter_continuity() detects gutters by their unique property: the shadow runs continuously from top to bottom without interruption. Divides the image into horizontal strips and checks what fraction of strips are darker than the page median at each column. A gutter column has >= 75% of strips darker. The transition point where the smoothed dark fraction drops below 50% marks the crop boundary. Integrated as fallback between scanner shadow and binary projection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 13:58:14 +02:00

1 2 3 4 5 ...

427 Commits