breakpilot-lehrer

Author	SHA1	Message	Date
Benjamin Admin	aed0edbf6d	Fix word split scoring: prefer longer words over short ones CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Failing after 20s Details CI / test-go-edu-search (push) Successful in 43s Details CI / test-python-klausur (push) Failing after 2m41s Details CI / test-python-agent-core (push) Successful in 24s Details CI / test-nodejs-website (push) Successful in 30s Details "Comeon" was split as "Com eon" instead of "Come on" because both are 2-word splits. Now uses sum-of-squared-lengths as tiebreaker: "come"(16) + "on"(4) = 20 > "com"(9) + "eon"(9) = 18. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:14:23 +02:00
Benjamin Admin	9e2c301723	Add merged-word splitting to OCR spell review CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details OCR often merges adjacent words when spacing is tight, e.g. "atmyschool" → "at my school", "goodidea" → "good idea". New _try_split_merged_word() uses dynamic programming to find the shortest sequence of dictionary words covering the token. Integrated as step 5 in _spell_fix_token() after general spell correction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 14:11:16 +02:00
Benjamin Admin	633e301bfd	Add camera gutter detection via vertical continuity analysis CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 45s Details CI / test-go-edu-search (push) Successful in 32s Details CI / test-python-klausur (push) Failing after 2m49s Details CI / test-python-agent-core (push) Successful in 30s Details CI / test-nodejs-website (push) Successful in 32s Details Scanner shadow detection (range > 40, darkest < 180) fails on camera book scans where the gutter shadow is subtle (range ~25, darkest ~214). New _detect_gutter_continuity() detects gutters by their unique property: the shadow runs continuously from top to bottom without interruption. Divides the image into horizontal strips and checks what fraction of strips are darker than the page median at each column. A gutter column has >= 75% of strips darker. The transition point where the smoothed dark fraction drops below 50% marks the crop boundary. Integrated as fallback between scanner shadow and binary projection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 13:58:14 +02:00
Benjamin Admin	9b5e8c6b35	Restructure upload flow: document first, then preview + naming CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 36s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-python-klausur (push) Failing after 2m39s Details CI / test-python-agent-core (push) Successful in 33s Details CI / test-nodejs-website (push) Successful in 24s Details Step 1 is now document selection (full width). After selecting a file, Step 2 shows a side-by-side layout with document preview (3/5 width, scrollable, with fullscreen modal) and session naming (2/5 width, with start button). Also adds PDF preview via blob URL before upload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 12:53:47 +02:00
Benjamin Admin	682b306e51	Use grid-build zones for vocab extraction (4-column detection) CI / go-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m44s Details CI / test-python-agent-core (push) Successful in 29s Details CI / test-nodejs-website (push) Successful in 36s Details The initial build_grid_from_words() under-clusters to 1 column while _build_grid_core() correctly finds 4 columns (marker, EN, DE, example). Now extracts vocab from grid zones directly, with heuristic to skip narrow marker columns. Falls back to original cells if zones fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:17:40 +02:00
Benjamin Admin	3e3116d2fd	Fix vocab extraction: show all columns for generic layouts CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 43s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m36s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 36s Details When columns can't be classified as EN/DE, map them by position: col 0 → english, col 1 → german, col 2+ → example. This ensures vocabulary pages are always extracted, even without explicit language classification. Classified pages still use the proper EN/DE/example mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:11:40 +02:00
Benjamin Admin	9a8ce69782	Fix vocab extraction: use original column types for EN/DE classification CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 37s Details CI / test-go-edu-search (push) Successful in 39s Details CI / test-python-agent-core (push) Has been cancelled Details CI / test-nodejs-website (push) Has been cancelled Details CI / test-python-klausur (push) Has been cancelled Details The grid-build zones use generic column types, losing the EN/DE classification from build_grid_from_words(). Now extracts improved cells from grid zones but classifies them using the original columns_meta which has the correct column_en/column_de types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:07:49 +02:00
Benjamin Admin	66f8a7b708	Improve vocab-worksheet UX: better status messages + error details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 45s Details CI / test-python-klausur (push) Failing after 2m19s Details CI / test-python-agent-core (push) Successful in 33s Details CI / test-nodejs-website (push) Successful in 35s Details - Change "PDF wird analysiert..." to "PDF wird hochgeladen..." (accurate) - Switch to pages tab immediately after upload (before thumbnails load) - Show progressive status: "5 Seiten erkannt. Vorschau wird geladen..." - Show backend error detail instead of generic "HTTP 404" - Backend returns helpful message when session not in memory after restart Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:55:56 +02:00
Benjamin Admin	3b78baf37f	Replace old OCR pipeline with Kombi pipeline + add IPA/syllable toggles CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 37s Details CI / test-python-klausur (push) Failing after 2m22s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 33s Details Backend: - _run_ocr_pipeline_for_page() now runs the full Kombi pipeline: orientation → deskew → dewarp → content crop → dual-engine OCR (RapidOCR + Tesseract merge) → _build_grid_core() with pipe-autocorrect, word-gap merge, dictionary detection - Accepts ipa_mode and syllable_mode query params on process-single-page - Pipeline sessions are visible in admin OCR Kombi UI for debugging Frontend (vocab-worksheet): - New "Anzeigeoptionen" section with IPA and syllable toggles - Settings are passed to process-single-page as query parameters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:43:42 +02:00
Benjamin Admin	2828871e42	Show detected page number in session header CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 42s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 27s Details CI / test-nodejs-website (push) Successful in 28s Details Extracts page_number from grid_editor_result when opening a session and displays it as "S. 233" badge in the SessionHeader, next to the category and GT badges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:20:53 +02:00
Benjamin Admin	5c96def4ec	Skip valid line-break hyphenations in gutter repair CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 38s Details CI / test-python-klausur (push) Failing after 2m33s Details CI / test-python-agent-core (push) Successful in 32s Details CI / test-nodejs-website (push) Successful in 31s Details Words ending with "-" where the stem is a known word (e.g. "wunder-" → "wunder" is known) are valid line-break hyphenations, not gutter errors. Gutter problems cause the hyphen to be LOST ("ve" instead of "ver-"), so a visible hyphen + known stem = intentional word-wrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:14:21 +02:00
Benjamin Admin	611e1ee33d	Add GT badge to grouped sessions and sub-pages in session list CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 39s Details CI / test-go-edu-search (push) Successful in 41s Details CI / test-python-klausur (push) Failing after 2m29s Details CI / test-python-agent-core (push) Successful in 28s Details CI / test-nodejs-website (push) Successful in 34s Details The GT badge was only shown on ungrouped SessionRow items. Now also visible on document group rows (e.g. "GT 1/2") and individual pages within expanded groups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 23:54:55 +02:00
Benjamin Admin	49d5212f0c	Fix hyphen-join: preserve next row + skip valid hyphenations CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 40s Details CI / test-python-klausur (push) Failing after 2m26s Details CI / test-python-agent-core (push) Successful in 27s Details CI / test-nodejs-website (push) Successful in 31s Details Two bugs fixed: - Apply no longer removes the continuation word from the next row. "künden" stays in row 31 — only the current row is repaired ("ve" → "ver-"). The original line-break layout is preserved. - Analysis now skips words that already end with "-" when the direct join with the next row is a known word (valid hyphenation, not an error). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:49:07 +02:00
Benjamin Admin	e6f8e12f44	Show full Grid-Review in Ground Truth step + GT badge in session list CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 37s Details CI / test-python-klausur (push) Failing after 2m18s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 27s Details - StepGroundTruth now shows the split view (original image + table) so the user can verify the final result before marking as GT - Backend session list now returns is_ground_truth flag - SessionList shows amber "GT" badge for marked sessions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:34:32 +02:00
Benjamin Admin	aabd849e35	Fix hyphen-join: strip trailing punctuation from continuation word CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 50s Details CI / test-go-edu-search (push) Successful in 47s Details CI / test-python-klausur (push) Failing after 2m35s Details CI / test-python-agent-core (push) Successful in 31s Details CI / test-nodejs-website (push) Successful in 34s Details The next-row word "künden," had a trailing comma, causing dictionary lookup to fail for "verkünden,". Now strips .,;:!? before joining. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:25:28 +02:00
Benjamin Admin	d1e7dd1c4a	Fix gutter repair: detect short fragments + show spell alternatives CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 48s Details CI / test-go-edu-search (push) Successful in 49s Details CI / test-python-klausur (push) Failing after 2m37s Details CI / test-python-agent-core (push) Successful in 35s Details CI / test-nodejs-website (push) Successful in 35s Details - Lower min word length from 3→2 for hyphen-join candidates so fragments like "ve" (from "ver-künden") are no longer skipped - Return all spellchecker candidates instead of just top-1, so user can pick the correct form (e.g. "stammeln" vs "stammelt") - Frontend shows clickable alternative buttons for spell_fix suggestions - Backend accepts text_overrides in apply endpoint for user-selected alternatives Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:09:12 +02:00
Benjamin Admin	71e1b10ac7	Add gutter repair step to OCR Kombi pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 41s Details CI / test-go-edu-search (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 2m31s Details CI / test-python-agent-core (push) Successful in 28s Details CI / test-nodejs-website (push) Successful in 29s Details New step "Wortkorrektur" between Grid-Review and Ground Truth that detects and fixes words truncated or blurred at the book gutter (binding area) of double-page scans. Uses pyspellchecker (DE+EN) for validation. Two repair strategies: - hyphen_join: words split across rows with missing chars (ve + künden → verkünden) - spell_fix: garbled trailing chars from gutter blur (stammeli → stammeln) Interactive frontend with per-suggestion accept/reject and batch controls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:50:16 +02:00
Benjamin Admin	21b69e06be	Fix cross-column word assignment by splitting OCR merge artifacts CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 47s Details CI / test-go-edu-search (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 2m21s Details CI / test-python-agent-core (push) Successful in 19s Details CI / test-nodejs-website (push) Successful in 23s Details When OCR merges adjacent words from different columns into one word box (e.g. "sichzie" spanning Col 1+2, "dasZimmer" crossing boundary), the grid builder assigned the entire merged word to one column. New _split_cross_column_words() function splits these at column boundaries using case transitions and spellchecker validation to avoid false positives on real words like "oder", "Kabel", "Zeitung". Regression: 12/12 GT sessions pass with diff=+0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-28 10:54:41 +01:00
Benjamin Admin	0168ab1a67	Remove Hauptseite/Box tabs from Kombi pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m15s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 20s Details Page-split now creates independent sessions that appear directly in the session list. After split, the UI switches to the first child session. BoxSessionTabs, sub-session state, and parent-child tracking removed from Kombi code. Legacy ocr-overlay still uses BoxSessionTabs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 17:43:58 +01:00
Benjamin Admin	925f4356ce	Use spellchecker instead of pyphen for pipe autocorrect validation CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m29s Details CI / test-python-agent-core (push) Successful in 16s Details CI / test-nodejs-website (push) Successful in 20s Details pyphen is a pattern-based hyphenator that accepts nonsense strings like "Zeplpelin". Switch to spellchecker (frequency-based word list) which correctly rejects garbled words and can suggest corrections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 16:47:42 +01:00
Benjamin Admin	cc4cb3bc2f	Add pipe auto-correction and graphic artifact filter for grid builder CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m10s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details - autocorrect_pipe_artifacts(): strips OCR pipe artifacts from printed syllable dividers, validates with pyphen, tries char-deletion near pipe positions for garbled words (e.g. "Ze\|plpe\|lin" → "Zeppelin") - Rule (a2): filters isolated non-alphanumeric word boxes (≤2 chars, no letters/digits) — catches small icons OCR'd as ">", "<" etc. - Both fixes are generic: pyphen-validated, no session-specific logic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 16:33:38 +01:00
Benjamin Admin	0685fb12da	Fix Bug 3: recover OCR-lost prefixes via overlap merge + chain merging CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m24s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 19s Details When OCR merge expands a prefix word box (e.g. "zer" w=42 → w=104), it heavily overlaps (>75%) with the next fragment ("brech"). The grid builder's overlap filter previously removed the prefix as a duplicate. Fix: when overlap > 75% but both boxes are alphabetic with different text and one is ≤ 4 chars, merge instead of removing. Also enable chain merging via merge_parent tracking so "zer" + "brech" + "lich" → "zerbrechlich" in a single pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 15:49:52 +01:00
Benjamin Admin	96ea23164d	Fix word-gap merge: add missing pronouns to stop words, reduce threshold CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 38s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m13s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 22s Details - Add du/dich/dir/mich/mir/uns/euch/ihm/ihn to _STOP_WORDS to prevent false merges like "du" + "zerlegst" → "duzerlegst" - Reduce max_short threshold from 6 to 5 to prevent merging multi-word phrases like "ziehen lassen" → "ziehenlassen" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 15:35:12 +01:00
Benjamin Admin	a8773d5b00	Fix 4 Grid Editor bugs: syllable modes, heading detection, word gaps 1. Syllable "Original" (auto) mode: only normalize cells that already have \| from OCR — don't add new syllable marks via pyphen to words without printed dividers on the original scan. 2. Syllable "Aus" (none) mode: strip residual \| chars from OCR text so cells display clean (e.g. "Zel\|le" → "Zelle"). 3. Heading detection: add text length guard in single-cell heuristic — words > 4 alpha chars starting lowercase (like "zentral") are regular vocabulary, not section headings. 4. Word-gap merge: new merge_word_gaps_in_zones() step with relaxed threshold (6 chars) fixes OCR splits like "zerknit tert" → "zerknittert". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 15:24:35 +01:00
Benjamin Admin	9f68bd3425	feat: Implement page-split step with auto-detection and sub-session naming StepPageSplit now: - Auto-calls POST /page-split on step entry - Shows oriented image + detection result - If double page: creates sub-sessions named "Title — S. 1/2" - If single page: green badge "keine Trennung noetig" - Manual "Weiter" button (no auto-advance) Also: - StepOrientation wrapper simplified (no page-split in orientation) - StepUpload passes name back via onUploaded(sid, name) - page.tsx: after page-split "Weiter" switches to first sub-session - useKombiPipeline exposes setSessionName Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 17:56:45 +01:00
Benjamin Admin	469f09d1e1	fix: Redesign StepUpload for manual step control StepUpload now has 3 phases: 1. File selection: drop zone / file picker → shows preview 2. Review: title input, category, file info → "Hochladen" button 3. Uploaded: shows session image → "Weiter" button No more auto-advance after upload. User controls every step. openSession() removed from onUploaded callback to prevent step-reset race condition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 17:35:36 +01:00
Benjamin Admin	3bb04b25ab	fix: OCR Kombi upload race condition — openSession was resetting step to 0 openSession mapped dbStep=1 to uiStep=0 (upload), overriding handleNext's advancement to step 1. Fix: sessions always exist post-upload, so always skip past the upload step in openSession. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 17:10:04 +01:00
Benjamin Admin	85fe0a73d6	docs: Add OCR Kombi Pipeline to MkDocs and cross-reference from OCR Pipeline CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 30s Details CI / test-go-edu-search (push) Successful in 30s Details CI / test-python-klausur (push) Failing after 2m28s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 16:09:40 +01:00
Benjamin Admin	eaade3cad2	feat: Maschinenbau-Branche + INDUSTRY_REGULATION_MAP erweitert - Neue Branche "Maschinenbau" mit 15 Regularien (MACHINERY_REG, BLUE_GUIDE, CRA, etc.) - BDSG zu allen DE-relevanten Branchen hinzugefuegt - Nationale Gesetze (HGB, AO, BGB, UrhG, etc.) branchenspezifisch gemapped - IoT erweitert: MACHINERY_REG, BLUE_GUIDE, NIS2, DE_ELEKTROG - THEMATIC_GROUPS: Produktsicherheit um MACHINERY_REG + BLUE_GUIDE erweitert Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:59:31 +01:00
Benjamin Admin	d26a9f60ab	Add OCR Kombi Pipeline: modular 11-step architecture with multi-page support CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 29s Details CI / test-go-edu-search (push) Successful in 28s Details CI / test-python-klausur (push) Failing after 2m24s Details CI / test-python-agent-core (push) Successful in 22s Details CI / test-nodejs-website (push) Successful in 20s Details Phase 1 of the clean architecture refactor: Replaces the 751-line ocr-overlay monolith with a modular pipeline. Each step gets its own component file. Frontend: /ai/ocr-kombi route with 11 steps (Upload, Orientation, PageSplit, Deskew, Dewarp, ContentCrop, OCR, Structure, GridBuild, GridReview, GroundTruth). Session list supports document grouping for multi-page uploads. Backend: New ocr_kombi/ module with multi-page PDF upload (splits PDF into N sessions with shared document_group_id). DB migration adds document_group_id and page_number columns. Old /ai/ocr-overlay remains fully functional for A/B testing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 15:55:28 +01:00
Benjamin Admin	d26233b5b3	Add page number display to StepGridReview summary bar CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 48s Details CI / test-go-edu-search (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 2m17s Details CI / test-python-agent-core (push) Successful in 18s Details CI / test-nodejs-website (push) Successful in 24s Details The page_number was only shown in GridEditor.tsx (ocr-overlay) but the OCR pipeline uses StepGridReview.tsx which has its own summary bar. Display the extracted page number (e.g. "S. 233") next to the dictionary detection badge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 11:21:44 +01:00
Benjamin Admin	e019dde01b	Extract page number as metadata instead of silently removing it CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 27s Details CI / test-go-edu-search (push) Successful in 36s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 21s Details _filter_footer_words now returns page number info (text, y_pct, number) instead of just removing footer words. The page number is included in the grid result as `page_number` and displayed in the frontend summary bar as "S. 233". This preserves page numbers for later page concatenation in the customer frontend while still removing them from the grid content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 08:52:09 +01:00
Benjamin Admin	5af5d821a5	Fix 3 grid issues: artifact cells, connector col noise, footer false positive CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m9s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 18s Details 1. Add per-cell artifact filter (4b2): removes single-word cells with ≤2 chars and confidence <65 (e.g. "as" from stray OCR marks) 2. Add narrow connector column normalization (4d2): when ≥60% of cells in a column share the same short text (e.g. "oder"), normalize near-match outliers like "oderb" → "oder" 3. Fix footer detection: require short text (≤20 chars) and no commas. Comma-separated lists like "Uhrzeit, Vergangenheit, Zukunft" are content continuations, not page numbers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 08:18:55 +01:00
Benjamin Admin	525de55791	Fix syllable+IPA combination: strip bracket content before IPA guard CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 35s Details CI / test-go-edu-search (push) Successful in 34s Details CI / test-python-klausur (push) Failing after 2m16s Details CI / test-python-agent-core (push) Successful in 17s Details CI / test-nodejs-website (push) Successful in 18s Details The _IPA_RE check in _syllabify_text() skipped entire cells containing any IPA character. After German IPA insertion adds [bɪltʃøn], the check blocked syllabification entirely. Now strips bracket content before checking, so programmatically inserted IPA doesn't prevent syllable divider insertion on the surrounding text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 00:03:10 +01:00
Benjamin Admin	f860eb66e6	Add German IPA support (wiki-pronunciation-dict + epitran) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 29s Details CI / test-python-klausur (push) Failing after 2m12s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details Hybrid approach mirroring English IPA: - Primary: wiki-pronunciation-dict (636k entries, CC-BY-SA, Wiktionary) - Fallback: epitran rule-based G2P (MIT license) IPA modes now use language-appropriate dictionaries: - auto/en: English IPA (Britfone + eng_to_ipa) - de: German IPA (wiki-pronunciation-dict + epitran) - all: EN column gets English IPA, other columns get German IPA - none: disabled Frontend shows CC-BY-SA attribution when German IPA is active. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 22:18:20 +01:00
Benjamin Admin	a73ddce43d	Fix missing PageZone import in grid_editor_helpers.py CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m52s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 15s Details The zone merging function used PageZone but the import was only in grid_editor_api.py. Caused NameError on sessions that trigger zone merging (e.g. original_scan_b59a1b1b). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 22:04:21 +01:00
Benjamin Admin	47e83d90bd	Remove IPA:DE option — no German IPA dictionary available CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Our IPA system only has English dictionaries (Britfone MIT, eng_to_ipa MIT). The "IPA: nur DE" option was useless at best and misleading. Removed from dropdown, type definition, and API validation. Syllable DE mode stays — pyphen has a German hyphenation dictionary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 21:53:43 +01:00
Benjamin Admin	76cd1ac020	Fix false headers on sparse layouts and IPA corruption on German text CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 33s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m55s Details CI / test-python-agent-core (push) Successful in 14s Details CI / test-nodejs-website (push) Successful in 17s Details 1. Header detection: Add 25% cap to single-cell heading heuristic. On German synonym dicts where most rows naturally have only 1 content cell, the old logic marked 60%+ of rows as headers. 2. IPA de/all mode: Use "column_text" (light processing) for non- English columns instead of "column_en" (full processing). The full path runs _insert_missing_ipa() which splits on whitespace, matches English prefixes ("bildschön" → "bild"), and truncates the rest — destroying German comma-separated synonym lists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 21:49:05 +01:00
Benjamin Admin	256df820cd	Auto-rebuild grid when IPA or syllable mode dropdown changes CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 2m0s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details The dropdowns only updated state but didn't trigger buildGrid(). Now a useEffect watches ipaMode/syllableMode and rebuilds automatically (skipping the initial mount). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 20:43:20 +01:00
Benjamin Admin	7773c51304	Fix en/de mode edge case on docs without detected English column CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 27s Details CI / test-python-klausur (push) Failing after 1m54s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 17s Details When no IPA signals exist (e.g. German-only dicts), the fallback that guesses en_col_type was incorrectly triggered for en/de modes, causing false IPA and syllable insertions. Now only fires for 'all' mode. Syllable en mode also returns empty set when no EN column found. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 08:37:15 +01:00
Benjamin Admin	83c058e400	Add language-specific IPA and syllable modes (de/en) CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m50s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details Extend ipa_mode and syllable_mode toggles with language options: - auto: smart detection (default) - en: only English headword column - de: only German definition columns - all: all content columns - none: skip entirely Also improve English column auto-detection: use garbled IPA patterns (apostrophes, colons) in addition to bracket patterns. This correctly identifies English dictionary pages where OCR produces garbled ASCII instead of bracket IPA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 08:16:29 +01:00
Benjamin Admin	34680732f8	Add IPA and syllable mode toggles, fix false IPA on German documents CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 2m1s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details Backend: Remove en_col_type fallback heuristic (longest avg text) that incorrectly identified German columns as English. IPA now only applied when OCR bracket patterns are actually found. Add ipa_mode (auto/all/none) and syllable_mode (auto/all/none) query params to build-grid API. Frontend: Add IPA and Silben dropdown selects to GridToolbar. Modes are passed as query params on rebuild. Auto = current smart detection, All = force for all words, Aus = skip entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 08:04:44 +01:00
Benjamin Admin	c42924a94a	Fix IPA correction persistence and false-positive prefix matching CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 34s Details CI / test-go-edu-search (push) Successful in 24s Details CI / test-python-klausur (push) Failing after 1m57s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 21s Details Step 5i was overwriting IPA-corrected text from Step 5c when reconstructing cells from word_boxes. Added _ipa_corrected flag to preserve corrections. Also tightened merged-token prefix matching (min prefix 4 chars, min suffix 3 chars) to prevent false positives like "sis" being extracted from "si:said". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-25 07:26:32 +01:00
Benjamin Admin	9ea217bdfc	Fix IPA correction for dictionary pages (WIP) - Fix Step 5h: restrict slash-IPA conversion to English headword column only — prevents converting "der/die/das" to "der [dər]das" in German columns (confirmed working) - Fix _text_has_garbled_ipa: detect embedded apostrophes in merged tokens like "Scotland'skotland" where OCR reads ˈ as ' - Fix _insert_missing_ipa: detect dictionary word prefix in merged trailing tokens like "fictionsalans'fIkfn" → extract "fiction" with IPA - Move en_col_type to wider scope for Step 5h access Note: Fixes 1+2 confirmed working in unit tests but not yet applying in the full build-grid pipeline — needs further debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 23:54:14 +01:00
Benjamin Admin	4feec7c7b7	Lower syllable pipe-ratio threshold from 5% to 1% CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 24s Details CI / test-go-edu-search (push) Successful in 25s Details CI / test-python-klausur (push) Failing after 1m58s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 16s Details Real dictionary pages have only ~3% OCR-detected pipes because the thin syllable divider lines are hard for OCR to read. The primary false-positive guard (article_col_index check) already blocks synonym dictionaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 23:17:08 +01:00
Benjamin Admin	ed7fc99fc4	Improve syllable divider insertion for dictionary pages Rewrite cv_syllable_detect.py with pyphen-first approach: - Remove unreliable CV gate (morphological pipe detection) - Strip existing pipes and re-syllabify via pyphen (DE then EN) - Merge pipe-gap spaces where OCR split words at divider positions - Guard merges with function word blacklist and punctuation checks Add false-positive prevention: - Pre-check: skip if <5% of cells have existing \| from OCR - Call-site check: require article_col_index (der/die/das column) - Prevents syllabification of synonym dictionaries and word lists Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 19:44:29 +01:00
Benjamin Admin	7fbcae954b	fix: auto-trigger orientation for page-split sessions without result Page-split sessions (start_step=1) have no orientation_result stored. StepOrientation now auto-runs orientation detection when loading an existing session that lacks a result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 17:19:56 +01:00
Benjamin Admin	f931091b57	refactor: independent sessions for page-split + URL-based pipeline navigation Page-split now creates independent sessions (no parent_session_id), parent marked as status='split' and hidden from list. Navigation uses useSearchParams for URL-based step tracking (browser back/forward works). page.tsx reduced from 684 to 443 lines via usePipelineNavigation hook. Box sub-sessions (column detection) remain unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 17:05:33 +01:00
Benjamin Admin	f34340de9c	Fix sub-session completion flow: navigate to next incomplete sub-session CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 25s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m51s Details CI / test-python-agent-core (push) Successful in 15s Details CI / test-nodejs-website (push) Successful in 15s Details Instead of returning to parent (which creates a redirect loop), the handleNext function now finds the next incomplete sub-session and opens it directly. When all sub-sessions are done, returns to session list. Also fixes openSession auto-redirect to prefer the first incomplete sub-session over the most advanced one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 16:33:56 +01:00
Benjamin Admin	55de6c21d2	Fix session resume: auto-open most advanced sub-session on parent click CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-school (push) Successful in 26s Details CI / test-go-edu-search (push) Successful in 26s Details CI / test-python-klausur (push) Failing after 1m46s Details CI / test-python-agent-core (push) Successful in 37s Details CI / test-nodejs-website (push) Successful in 15s Details When reopening a parent session that has page-split sub-sessions, the UI was showing the parent's pipeline step (always step 1/Orientation) instead of navigating to the sub-sessions. Now automatically opens the most advanced sub-session, matching the behavior of handleOrientationComplete. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 16:04:53 +01:00

1 2 3 4 5 ...

525 Commits