breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	04fa01661c	Move IPA/syllable toggles to vocabulary tab toolbar CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 49s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m51s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 36s Details Dropdowns are now in the vocabulary table header (after processing), not in the worksheet settings (before processing). Changing a mode automatically reprocesses all successful pages with the new settings. Same dropdown options as the OCR pipeline grid editor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:17:14 +02:00
Benjamin Admin	bf9d24e108	Replace IPA/syllable checkboxes with full dropdowns in vocab-worksheet CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m41s Details CI / test-python-agent-core (push) Successful in 39s Details CI / test-nodejs-website (push) Successful in 42s Details Vocab worksheet now has the same IPA/syllable mode options as the OCR pipeline grid editor: Auto, nur EN, nur DE, Alle, Aus. Previously only had on/off checkboxes mapping to auto/none. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:10:22 +02:00
Benjamin Admin	0f17eb3cd9	Fix IPA:Aus — strip all brackets before skipping IPA block CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 49s Details CI / test-go-edu-search (push) Successful in 35s Details CI / test-python-klausur (push) Failing after 2m53s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has started running Details When ipa_mode=none, the entire IPA processing block was skipped, including the bracket-stripping logic. Now strips ALL square brackets from content columns BEFORE the skip, so IPA:Aus actually removes all IPA from the display. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:05:22 +02:00
Benjamin Admin	5244e10728	Fix IPA/syllable race condition: loadGrid no longer depends on buildGrid CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m55s Details CI / test-python-agent-core (push) Successful in 35s Details CI / test-nodejs-website (push) Has been cancelled Details loadGrid depended on buildGrid (for 404 fallback), which depended on ipaMode/syllableMode. Every mode change created a new loadGrid ref, triggering StepGridReview's useEffect to load the OLD saved grid, overwriting the freshly rebuilt one. Now loadGrid only depends on sessionId. The 404 fallback builds inline with current modes. Mode changes are handled exclusively by the separate rebuild useEffect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:59:49 +02:00
Benjamin Admin	a6c5f56003	Fix IPA strip: match all square brackets, not just Unicode IPA CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 29s Details CI / test-nodejs-website (push) Successful in 23s Details OCR text contains ASCII IPA approximations like [kompa'tifn] instead of Unicode [kˈɒmpətɪʃən]. The strip regex required Unicode IPA chars inside brackets and missed the ASCII ones. Now strips all [bracket] content from excluded columns since square brackets in vocab columns are always IPA. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:53:16 +02:00
Benjamin Admin	584e07eb21	Strip English IPA when mode excludes EN (nur DE / Aus) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 46s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details English IPA from the original OCR scan (e.g. [ˈgrænˌdæd]) was always shown because fix_cell_phonetics only ADDS/CORRECTS but never removes. Now strips IPA brackets containing Unicode IPA chars from the EN column when ipa_mode is "de" or "none". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:49:22 +02:00
Benjamin Admin	54b1c7d7d7	Fix IPA/syllable first-click not working (off-by-one in initialLoadDone) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 46s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m52s Details CI / test-python-agent-core (push) Successful in 36s Details CI / test-nodejs-website (push) Successful in 38s Details The old guard checked if grid was loaded AND set initialLoadDone in the same pass, then returned without rebuilding. This meant the first user-triggered mode change was always swallowed. Simplified to a mount-skip ref: skip exactly the first useEffect trigger (component mount), rebuild on every subsequent trigger (user changes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:40:57 +02:00
Benjamin Admin	d8a2331038	Fix IPA/syllable mode change requiring double-click CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m58s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 38s Details The useEffect for mode changes called buildGrid() which was a useCallback closing over stale ipaMode/syllableMode values due to React's asynchronous state batching. The first click triggered a rebuild with the OLD mode; only the second click used the new one. Now inlines the API call directly in the useEffect, reading ipaMode and syllableMode from the effect's closure which always has the current values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:32:02 +02:00
Benjamin Admin	ad78e26143	Fix word-split: handle IPA brackets, contractions, and tiebreaker CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 46s Details CI / test-python-klausur (push) Failing after 2m57s Details CI / test-python-agent-core (push) Successful in 36s Details CI / test-nodejs-website (push) Successful in 41s Details 1. Strip IPA brackets [ipa] before attempting word split, so "makeadecision[dɪsˈɪʒən]" is processed as "makeadecision" 2. Handle contractions: "solet's" → split "solet" → "so let" + "'s" 3. DP tiebreaker: prefer longer first word when scores are equal ("task is" over "ta skis") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:13:02 +02:00
Benjamin Admin	4f4e6c31fa	Fix word-split tiebreaker: prefer longer first word CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m44s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 35s Details "taskis" was split as "ta skis" instead of "task is" because both have the same DP score. Changed comparison from > to >= so that later candidates (with longer first words) win ties. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:05:14 +02:00
Benjamin Admin	7ffa4c90f9	Lower word-split threshold from 7 to 4 chars CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 46s Details CI / test-python-klausur (push) Failing after 2m48s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 38s Details Short merged words like "anew" (a new), "Imadea" (I made a), "makeadecision" (make a decision) were missed because the split threshold was too high. Now processes tokens >= 4 chars. English single-letter words (a, I) are already handled by the DP algorithm which allows them as valid split points. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:59:02 +02:00
Benjamin Admin	656cadbb1e	Remove page-number footers from grid, promote to metadata CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 40s Details CI / test-python-klausur (push) Failing after 2m55s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 37s Details Footer rows that are page numbers (digits or written-out like "two hundred and nine") are now removed from the grid entirely and promoted to the page_number metadata field. Non-page-number footer content stays as a visible footer row. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:50:20 +02:00
Benjamin Admin	757c8460c9	Detect written-out page numbers as footer rows CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 44s Details CI / test-python-klausur (push) Failing after 2m46s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 39s Details "two hundred and nine" (22 chars) was kept as a content row because the footer detection only accepted text ≤20 chars. Now recognizes written-out number words (English + German) as page numbers regardless of length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:39:43 +02:00
Benjamin Admin	501de4374a	Keep page references as visible column cells CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 35s Details Step 5g was extracting page refs (p.55, p.70) as zone metadata and removing them from the cell table. Users want to see them as a separate column. Now keeps cells in place while still extracting metadata for the frontend header display. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:27:44 +02:00
Benjamin Admin	774bbc50d3	Add debug logging for empty-column-removal CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 54s Details CI / test-python-klausur (push) Failing after 2m53s Details CI / test-python-agent-core (push) Successful in 39s Details CI / test-nodejs-website (push) Successful in 39s Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:45:22 +02:00
Benjamin Admin	9ceee4e07c	Protect page references from junk-row removal CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 11s Details CI / test-go-edu-search (push) Successful in 57s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details Rows containing only a page reference (p.55, S.12) were removed as "oversized stubs" (Rule 2) when their word-box height exceeded the median. Now skips Rule 2 if any word matches the page-ref pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:40:37 +02:00
Benjamin Admin	f23aaaea51	Fix false header detection: skip continuation lines and mid-column cells CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 57s Details CI / test-python-klausur (push) Failing after 2m57s Details CI / test-python-agent-core (push) Successful in 28s Details CI / test-nodejs-website (push) Successful in 34s Details Single-cell rows were incorrectly detected as headings when they were actually continuation lines. Two new guards: 1. Text starting with "(" is a continuation (e.g. "(usw.)", "(TV-Serie)") 2. Single cells beyond the first two content columns are overflow lines, not headings. Real headings appear in the first columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:21:09 +02:00
Benjamin Admin	cde13c9623	Fix IPA stripping digits after headwords (Theme 1 → Theme) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 46s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m46s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 30s Details _insert_missing_ipa stripped "1" from "Theme 1" because it treated the digit as garbled OCR phonetics. Now treats pure digits/numbering patterns (1, 2., 3)) as delimiters that stop the garble-stripping. Also fixes _has_non_dict_trailing which incorrectly flagged "Theme 1" as having non-dictionary trailing text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:13:45 +02:00
Benjamin Admin	2e42167c73	Remove empty columns from grid zones CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 52s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m43s Details CI / test-python-agent-core (push) Successful in 34s Details CI / test-nodejs-website (push) Successful in 29s Details Columns with zero cells (e.g. from tertiary detection where the word was assigned to a neighboring column by overlap) are stripped from the final result. Remaining columns and cells are re-indexed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:04:49 +02:00
Benjamin Admin	5eff4cf877	Fix page refs deleted as artifacts + IPA spacing for DE mode CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 54s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has started running Details 1. Step 5j-pre wrongly classified "p.43", "p.50" etc as artifacts (mixed digits+letters, <=5 chars). Added exception for page reference patterns (p.XX, S.XX). 2. IPA spacing regex was too narrow (only matched Unicode IPA chars). Now matches any [bracket] content >=2 chars directly after a letter, fixing German IPA like "Opa[oːpa]" → "Opa [oːpa]". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 22:01:25 +02:00
Benjamin Admin	7f4b8757ff	Fix IPA spacing + add zone debug logging for marker column issue CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 55s Details CI / test-go-edu-search (push) Successful in 49s Details CI / test-python-klausur (push) Failing after 2m48s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 37s Details 1. Ensure space before IPA brackets in cell text: "word[ipa]" → "word [ipa]" Applied as final cleanup in grid-build finalization. 2. Add debug logging for zone-word assignment to diagnose why marker column cells are empty despite correct column detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:51:52 +02:00
Benjamin Admin	7263328edb	Fix marker column detection: remove min-rows requirement CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m55s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 22s Details Words to the left of the first detected column boundary must always form their own column, regardless of how few rows they appear in. Previously required 4+ distinct rows for tertiary (margin) columns, which missed page references like p.62, p.63, p.64 (only 3 rows). Now any cluster at the left/right margin with a clear gap to the nearest significant column qualifies as its own column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:24:25 +02:00
Benjamin Admin	8c482ce8dd	Fix Grid Build step: show grid-editor summary instead of word_result CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 31s Details CI / test-go-edu-search (push) Successful in 31s Details CI / test-python-klausur (push) Failing after 2m31s Details CI / test-python-agent-core (push) Successful in 21s Details CI / test-nodejs-website (push) Successful in 23s Details The Grid Build step was showing word_result.grid_shape (from the initial OCR word clustering, often just 1 column) instead of the grid-editor summary (zone-based, with correct column/row/cell counts). Now reads summary.total_rows/total_columns/total_cells from the grid-editor result. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:01:18 +02:00
Benjamin Admin	00f7a7154c	Fix left-side gutter detection: find peak instead of scanning from edge CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 40s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 32s Details Left-side book fold shadows have a V-shape: brightness dips from the edge toward a peak at ~5-10% of width, then rises again. The previous algorithm scanned from the edge inward and immediately found a low dark fraction (0.13 at x=0), missing the gutter entirely. Now finds the PEAK of the dark fraction profile first, then scans from that peak toward the page center to find the transition point. Works for both V-shaped left gutters and edge-darkening right gutters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 16:52:23 +02:00
Benjamin Admin	9c5e950c99	Fix multi-page PDF upload: include session_id for first page CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-nodejs-website (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 10m2s Details CI / test-go-edu-search (push) Failing after 10m9s Details CI / test-python-agent-core (push) Failing after 14m58s Details The frontend expects session_id in the upload response, but multi-page PDFs returned only document_group_id + pages[]. Now includes session_id pointing to the first page for backwards compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 16:26:25 +02:00
Benjamin Admin	6e494a43ab	Apply merged-word splitting to grid-editor cells CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 42s Details CI / test-go-edu-search (push) Successful in 44s Details CI / test-python-klausur (push) Failing after 2m28s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 32s Details The spell review only runs on vocab entries, but the OCR pipeline's grid-editor cells also contain merged words (e.g. "atmyschool"). Now splits merged words directly in the grid-build finalization step, right before returning the result. Uses the same _try_split_merged_word() dictionary-based DP algorithm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:52:00 +02:00
Benjamin Admin	53b0d77853	Multi-page PDF support: create one session per page CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 27s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-klausur (push) Failing after 2m36s Details CI / test-python-agent-core (push) Successful in 24s Details CI / test-nodejs-website (push) Successful in 35s Details When uploading a PDF with > 1 page to the OCR pipeline, each page now gets its own session (grouped by document_group_id). Previously only page 1 was processed. The response includes a pages array with all session IDs so the frontend can navigate between them. Single-page PDFs and images continue to work as before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:39:48 +02:00
Benjamin Admin	aed0edbf6d	Fix word split scoring: prefer longer words over short ones CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 20s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m41s Details CI / test-python-agent-core (push) Successful in 24s Details CI / test-nodejs-website (push) Successful in 30s Details "Comeon" was split as "Com eon" instead of "Come on" because both are 2-word splits. Now uses sum-of-squared-lengths as tiebreaker: "come"(16) + "on"(4) = 20 > "com"(9) + "eon"(9) = 18. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:14:23 +02:00
Benjamin Admin	9e2c301723	Add merged-word splitting to OCR spell review CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details OCR often merges adjacent words when spacing is tight, e.g. "atmyschool" → "at my school", "goodidea" → "good idea". New _try_split_merged_word() uses dynamic programming to find the shortest sequence of dictionary words covering the token. Integrated as step 5 in _spell_fix_token() after general spell correction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:11:16 +02:00
Benjamin Admin	633e301bfd	Add camera gutter detection via vertical continuity analysis CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 32s Details Scanner shadow detection (range > 40, darkest < 180) fails on camera book scans where the gutter shadow is subtle (range ~25, darkest ~214). New _detect_gutter_continuity() detects gutters by their unique property: the shadow runs continuously from top to bottom without interruption. Divides the image into horizontal strips and checks what fraction of strips are darker than the page median at each column. A gutter column has >= 75% of strips darker. The transition point where the smoothed dark fraction drops below 50% marks the crop boundary. Integrated as fallback between scanner shadow and binary projection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 13:58:14 +02:00
Benjamin Admin	9b5e8c6b35	Restructure upload flow: document first, then preview + naming CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 33s Details CI / test-nodejs-website (push) Successful in 24s Details Step 1 is now document selection (full width). After selecting a file, Step 2 shows a side-by-side layout with document preview (3/5 width, scrollable, with fullscreen modal) and session naming (2/5 width, with start button). Also adds PDF preview via blob URL before upload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 12:53:47 +02:00
Benjamin Admin	682b306e51	Use grid-build zones for vocab extraction (4-column detection) CI / go-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m44s Details CI / test-python-agent-core (push) Successful in 29s Details CI / test-nodejs-website (push) Successful in 36s Details The initial build_grid_from_words() under-clusters to 1 column while _build_grid_core() correctly finds 4 columns (marker, EN, DE, example). Now extracts vocab from grid zones directly, with heuristic to skip narrow marker columns. Falls back to original cells if zones fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:17:40 +02:00
Benjamin Admin	3e3116d2fd	Fix vocab extraction: show all columns for generic layouts CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m36s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 36s Details When columns can't be classified as EN/DE, map them by position: col 0 → english, col 1 → german, col 2+ → example. This ensures vocabulary pages are always extracted, even without explicit language classification. Classified pages still use the proper EN/DE/example mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:11:40 +02:00
Benjamin Admin	9a8ce69782	Fix vocab extraction: use original column types for EN/DE classification CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details The grid-build zones use generic column types, losing the EN/DE classification from build_grid_from_words(). Now extracts improved cells from grid zones but classifies them using the original columns_meta which has the correct column_en/column_de types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:07:49 +02:00
Benjamin Admin	66f8a7b708	Improve vocab-worksheet UX: better status messages + error details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m19s Details CI / test-python-agent-core (push) Successful in 33s Details CI / test-nodejs-website (push) Successful in 35s Details - Change "PDF wird analysiert..." to "PDF wird hochgeladen..." (accurate) - Switch to pages tab immediately after upload (before thumbnails load) - Show progressive status: "5 Seiten erkannt. Vorschau wird geladen..." - Show backend error detail instead of generic "HTTP 404" - Backend returns helpful message when session not in memory after restart Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:55:56 +02:00
Benjamin Admin	3b78baf37f	Replace old OCR pipeline with Kombi pipeline + add IPA/syllable toggles CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 37s Details CI / test-python-klausur (push) Failing after 2m22s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 33s Details Backend: - _run_ocr_pipeline_for_page() now runs the full Kombi pipeline: orientation → deskew → dewarp → content crop → dual-engine OCR (RapidOCR + Tesseract merge) → _build_grid_core() with pipe-autocorrect, word-gap merge, dictionary detection - Accepts ipa_mode and syllable_mode query params on process-single-page - Pipeline sessions are visible in admin OCR Kombi UI for debugging Frontend (vocab-worksheet): - New "Anzeigeoptionen" section with IPA and syllable toggles - Settings are passed to process-single-page as query parameters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:43:42 +02:00
Benjamin Admin	2828871e42	Show detected page number in session header CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 27s Details CI / test-nodejs-website (push) Successful in 28s Details Extracts page_number from grid_editor_result when opening a session and displays it as "S. 233" badge in the SessionHeader, next to the category and GT badges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:20:53 +02:00
Benjamin Admin	5c96def4ec	Skip valid line-break hyphenations in gutter repair CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-python-klausur (push) Failing after 2m33s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 31s Details Words ending with "-" where the stem is a known word (e.g. "wunder-" → "wunder" is known) are valid line-break hyphenations, not gutter errors. Gutter problems cause the hyphen to be LOST ("ve" instead of "ver-"), so a visible hyphen + known stem = intentional word-wrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:14:21 +02:00
Benjamin Admin	611e1ee33d	Add GT badge to grouped sessions and sub-pages in session list CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m29s Details CI / test-python-agent-core (push) Successful in 28s Details CI / test-nodejs-website (push) Successful in 34s Details The GT badge was only shown on ungrouped SessionRow items. Now also visible on document group rows (e.g. "GT 1/2") and individual pages within expanded groups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 23:54:55 +02:00
Benjamin Admin	49d5212f0c	Fix hyphen-join: preserve next row + skip valid hyphenations CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 40s Details CI / test-python-klausur (push) Failing after 2m26s Details CI / test-python-agent-core (push) Successful in 27s Details CI / test-nodejs-website (push) Successful in 31s Details Two bugs fixed: - Apply no longer removes the continuation word from the next row. "künden" stays in row 31 — only the current row is repaired ("ve" → "ver-"). The original line-break layout is preserved. - Analysis now skips words that already end with "-" when the direct join with the next row is a known word (valid hyphenation, not an error). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:49:07 +02:00
Benjamin Admin	e6f8e12f44	Show full Grid-Review in Ground Truth step + GT badge in session list CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 37s Details CI / test-python-klausur (push) Failing after 2m18s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 27s Details - StepGroundTruth now shows the split view (original image + table) so the user can verify the final result before marking as GT - Backend session list now returns is_ground_truth flag - SessionList shows amber "GT" badge for marked sessions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:34:32 +02:00
Benjamin Admin	aabd849e35	Fix hyphen-join: strip trailing punctuation from continuation word CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m35s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 34s Details The next-row word "künden," had a trailing comma, causing dictionary lookup to fail for "verkünden,". Now strips .,;:!? before joining. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:25:28 +02:00
Benjamin Admin	d1e7dd1c4a	Fix gutter repair: detect short fragments + show spell alternatives CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 48s Details CI / test-go-edu-search (push) Successful in 49s Details CI / test-python-klausur (push) Failing after 2m37s Details CI / test-python-agent-core (push) Successful in 35s Details CI / test-nodejs-website (push) Successful in 35s Details - Lower min word length from 3→2 for hyphen-join candidates so fragments like "ve" (from "ver-künden") are no longer skipped - Return all spellchecker candidates instead of just top-1, so user can pick the correct form (e.g. "stammeln" vs "stammelt") - Frontend shows clickable alternative buttons for spell_fix suggestions - Backend accepts text_overrides in apply endpoint for user-selected alternatives Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:09:12 +02:00
Benjamin Admin	71e1b10ac7	Add gutter repair step to OCR Kombi pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 2m31s Details CI / test-python-agent-core (push) Successful in 28s Details CI / test-nodejs-website (push) Successful in 29s Details New step "Wortkorrektur" between Grid-Review and Ground Truth that detects and fixes words truncated or blurred at the book gutter (binding area) of double-page scans. Uses pyspellchecker (DE+EN) for validation. Two repair strategies: - hyphen_join: words split across rows with missing chars (ve + künden → verkünden) - spell_fix: garbled trailing chars from gutter blur (stammeli → stammeln) Interactive frontend with per-suggestion accept/reject and batch controls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:50:16 +02:00
Benjamin Admin	21b69e06be	Fix cross-column word assignment by splitting OCR merge artifacts CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 23s Details When OCR merges adjacent words from different columns into one word box (e.g. "sichzie" spanning Col 1+2, "dasZimmer" crossing boundary), the grid builder assigned the entire merged word to one column. New _split_cross_column_words() function splits these at column boundaries using case transitions and spellchecker validation to avoid false positives on real words like "oder", "Kabel", "Zeitung". Regression: 12/12 GT sessions pass with diff=+0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-28 10:54:41 +01:00
Benjamin Admin	0168ab1a67	Remove Hauptseite/Box tabs from Kombi pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m15s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 20s Details Page-split now creates independent sessions that appear directly in the session list. After split, the UI switches to the first child session. BoxSessionTabs, sub-session state, and parent-child tracking removed from Kombi code. Legacy ocr-overlay still uses BoxSessionTabs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 17:43:58 +01:00
Benjamin Admin	925f4356ce	Use spellchecker instead of pyphen for pipe autocorrect validation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m29s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 20s Details pyphen is a pattern-based hyphenator that accepts nonsense strings like "Zeplpelin". Switch to spellchecker (frequency-based word list) which correctly rejects garbled words and can suggest corrections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 16:47:42 +01:00
Benjamin Admin	cc4cb3bc2f	Add pipe auto-correction and graphic artifact filter for grid builder CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m10s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details - autocorrect_pipe_artifacts(): strips OCR pipe artifacts from printed syllable dividers, validates with pyphen, tries char-deletion near pipe positions for garbled words (e.g. "Ze\|plpe\|lin" → "Zeppelin") - Rule (a2): filters isolated non-alphanumeric word boxes (≤2 chars, no letters/digits) — catches small icons OCR'd as ">", "<" etc. - Both fixes are generic: pyphen-validated, no session-specific logic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 16:33:38 +01:00
Benjamin Admin	0685fb12da	Fix Bug 3: recover OCR-lost prefixes via overlap merge + chain merging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m24s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details When OCR merge expands a prefix word box (e.g. "zer" w=42 → w=104), it heavily overlaps (>75%) with the next fragment ("brech"). The grid builder's overlap filter previously removed the prefix as a duplicate. Fix: when overlap > 75% but both boxes are alphabetic with different text and one is ≤ 4 chars, merge instead of removing. Also enable chain merging via merge_parent tracking so "zer" + "brech" + "lich" → "zerbrechlich" in a single pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 15:49:52 +01:00
Benjamin Admin	96ea23164d	Fix word-gap merge: add missing pronouns to stop words, reduce threshold CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m13s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 22s Details - Add du/dich/dir/mich/mir/uns/euch/ihm/ihn to _STOP_WORDS to prevent false merges like "du" + "zerlegst" → "duzerlegst" - Reduce max_short threshold from 6 to 5 to prevent merging multi-word phrases like "ziehen lassen" → "ziehenlassen" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 15:35:12 +01:00

1 2 3 4 5 ...

552 Commits