Commit Graph

484 Commits

Author SHA1 Message Date
Benjamin Admin
34680732f8 Add IPA and syllable mode toggles, fix false IPA on German documents
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s
Backend: Remove en_col_type fallback heuristic (longest avg text) that
incorrectly identified German columns as English. IPA now only applied
when OCR bracket patterns are actually found. Add ipa_mode (auto/all/none)
and syllable_mode (auto/all/none) query params to build-grid API.

Frontend: Add IPA and Silben dropdown selects to GridToolbar. Modes
are passed as query params on rebuild. Auto = current smart detection,
All = force for all words, Aus = skip entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-25 08:04:44 +01:00
Benjamin Admin
c42924a94a Fix IPA correction persistence and false-positive prefix matching
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 21s
Step 5i was overwriting IPA-corrected text from Step 5c when
reconstructing cells from word_boxes. Added _ipa_corrected flag
to preserve corrections. Also tightened merged-token prefix matching
(min prefix 4 chars, min suffix 3 chars) to prevent false positives
like "sis" being extracted from "si:said".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-25 07:26:32 +01:00
Benjamin Admin
9ea217bdfc Fix IPA correction for dictionary pages (WIP)
- Fix Step 5h: restrict slash-IPA conversion to English headword column
  only — prevents converting "der/die/das" to "der [dər]das" in German
  columns (confirmed working)
- Fix _text_has_garbled_ipa: detect embedded apostrophes in merged
  tokens like "Scotland'skotland" where OCR reads ˈ as '
- Fix _insert_missing_ipa: detect dictionary word prefix in merged
  trailing tokens like "fictionsalans'fIkfn" → extract "fiction" with IPA
- Move en_col_type to wider scope for Step 5h access

Note: Fixes 1+2 confirmed working in unit tests but not yet applying
in the full build-grid pipeline — needs further debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 23:54:14 +01:00
Benjamin Admin
4feec7c7b7 Lower syllable pipe-ratio threshold from 5% to 1%
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Real dictionary pages have only ~3% OCR-detected pipes because the thin
syllable divider lines are hard for OCR to read. The primary false-positive
guard (article_col_index check) already blocks synonym dictionaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 23:17:08 +01:00
Benjamin Admin
ed7fc99fc4 Improve syllable divider insertion for dictionary pages
Rewrite cv_syllable_detect.py with pyphen-first approach:
- Remove unreliable CV gate (morphological pipe detection)
- Strip existing pipes and re-syllabify via pyphen (DE then EN)
- Merge pipe-gap spaces where OCR split words at divider positions
- Guard merges with function word blacklist and punctuation checks

Add false-positive prevention:
- Pre-check: skip if <5% of cells have existing | from OCR
- Call-site check: require article_col_index (der/die/das column)
- Prevents syllabification of synonym dictionaries and word lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 19:44:29 +01:00
Benjamin Admin
7fbcae954b fix: auto-trigger orientation for page-split sessions without result
Page-split sessions (start_step=1) have no orientation_result stored.
StepOrientation now auto-runs orientation detection when loading an
existing session that lacks a result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 17:19:56 +01:00
Benjamin Admin
f931091b57 refactor: independent sessions for page-split + URL-based pipeline navigation
Page-split now creates independent sessions (no parent_session_id),
parent marked as status='split' and hidden from list. Navigation uses
useSearchParams for URL-based step tracking (browser back/forward works).
page.tsx reduced from 684 to 443 lines via usePipelineNavigation hook.

Box sub-sessions (column detection) remain unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 17:05:33 +01:00
Benjamin Admin
f34340de9c Fix sub-session completion flow: navigate to next incomplete sub-session
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s
Instead of returning to parent (which creates a redirect loop), the
handleNext function now finds the next incomplete sub-session and opens
it directly. When all sub-sessions are done, returns to session list.

Also fixes openSession auto-redirect to prefer the first incomplete
sub-session over the most advanced one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 16:33:56 +01:00
Benjamin Admin
55de6c21d2 Fix session resume: auto-open most advanced sub-session on parent click
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 37s
CI / test-nodejs-website (push) Successful in 15s
When reopening a parent session that has page-split sub-sessions,
the UI was showing the parent's pipeline step (always step 1/Orientation)
instead of navigating to the sub-sessions. Now automatically opens the
most advanced sub-session, matching the behavior of handleOrientationComplete.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 16:04:53 +01:00
Benjamin Admin
52b66ebe07 Fix NameError: _text_has_garbled_ipa not imported in grid_editor_helpers
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
After refactoring grid_editor_api.py into helpers, the function
_text_has_garbled_ipa was used in _detect_heading_rows_by_single_cell
but never imported from cv_ocr_engines. This caused HTTP 500 on
build-grid for sessions that trigger single-cell heading detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 15:11:29 +01:00
Benjamin Admin
424e5c51d4 fix: remove nested scrollbar in grid editor
Removed overflow-y-auto and maxHeight from the grid container div.
The page itself handles scrolling — nested scroll containers caused
the bottom rows to be cut off after editing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 15:06:28 +01:00
Benjamin Admin
12b4c61bac refactor: extract grid helpers + generic CV-gated syllable insertion
1. Extracted 1367 lines of helper functions from grid_editor_api.py
   (3051→1620 lines) into grid_editor_helpers.py (filters, detectors,
   zone grid building).

2. Created cv_syllable_detect.py with generic CV+pyphen logic:
   - Checks EVERY word_box for vertical pipe lines (not just first word)
   - No article-column dependency — works with any dictionary layout
   - CV morphological detection gates pyphen insertion

3. Grid editor scroll: calc(100vh-200px) for reliable scrolling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 14:39:33 +01:00
Benjamin Admin
d9b2aa82e9 fix: CV-gated syllable insertion + grid editor scroll
1. Syllable dividers now require CV validation: morphological vertical
   line detection checks if word_box image actually shows thin isolated
   pipe lines before applying pyphen. Only first word per cell gets
   pipes (matching dictionary print layout).

2. Grid editor scroll: changed maxHeight from 80vh to calc(100vh-200px)
   so editor remains scrollable after edits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 14:31:16 +01:00
Benjamin Admin
364086b86e feat: auto-insert syllable dividers via pyphen on dictionary pages
OCR engines don't detect | pipe chars used as syllable dividers in
dictionaries. After dictionary detection (is_dict=True), use pyphen
(MIT) to insert syllable breaks into headword cells. Tries DE first,
then EN. Skips IPA content, short words, and cells already containing |.

Also adds pyphen>=0.16.0 to requirements.txt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 14:17:26 +01:00
Benjamin Admin
fe754398c0 fix: Step 4f sidebar detection uses avg text length instead of fill ratio
Column_1 data showed avg_len=1.0 with 13 single-char cells (alphabet
letters from sidebar). Old fill_ratio check (76% > 35%) missed it.
New criteria: avg_len ≤ 1.5 AND ≥ 70% single chars → removes column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 14:10:43 +01:00
Benjamin Admin
be86a7d14d fix: preserve pipe syllable dividers + detect alphabet sidebar columns
1. Pipe divider fix: Changed OCR char-confusion regex so | between
   letters (Ka|me|rad) is NOT converted to I. Only standalone/
   word-boundary pipes are converted (|ch → Ich, | want → I want).

2. Alphabet sidebar detection improvements:
   - _filter_decorative_margin() now considers 2-char words (OCR reads
     "Aa", "Bb" from sidebars), lowered min strip from 8→6
   - _filter_border_strip_words() lowered decorative threshold from 50%→45%
   - New step 4f: grid-level thin-edge-column filter as safety net —
     removes edge columns with <35% fill rate and >60% short text

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 13:52:11 +01:00
Benjamin Admin
19a5f69272 fix: make Grid Editor vertically scrollable so all rows are visible
The right panel (grid area) had no vertical overflow handling, causing
the last ~5 rows to be clipped and invisible. Added overflow-y-auto
with max-height 80vh, and removed overflow-hidden from the GridTable
wrapper that was cutting off content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 13:33:52 +01:00
Benjamin Admin
ea09fc75df fix: resolve circular import with lazy import for _build_reference_snapshot
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 13:18:21 +01:00
Benjamin Admin
410d36f3de feat: save automatic grid snapshot before manual edits for GT comparison
- build-grid now saves the automatic OCR result as ground_truth.auto_grid_snapshot
- mark-ground-truth includes a correction_diff comparing auto vs corrected
- New endpoint GET /correction-diff returns detailed diff with per-col_type
  accuracy breakdown (english, german, ipa, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 13:16:44 +01:00
Benjamin Admin
72ce4420cb fix: advance uiStep past skipped orientation for page-split sub-sessions
Page-split sub-sessions (current_step=2) had orientation marked as skipped
but uiStep remained at 0 (orientation step), causing StepOrientation to
render for a sub-session that has no orientation data. Now advances to
uiStep=1 (deskew) when orientation is skipped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 12:59:36 +01:00
Benjamin Admin
63dfb4d06f fix: replace reset useEffects with key prop for step component remount
The reset useEffects in StepOrientation/Deskew/Dewarp/Crop were clearing
orientationResult when sessionId changed (e.g. during handleOrientationComplete),
causing the right side of ImageCompareView to show nothing. Using key={sessionId}
on the step components instead forces React to remount with fresh state when
switching sessions, without interfering with the upload/orientation flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 12:20:50 +01:00
Benjamin Admin
08a91ba2be Fix sub-session tab switching: reset step state on sessionId change
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
Step components (Deskew, Dewarp, Crop, Orientation) had local state
guards that prevented reloading when sessionId changed via sub-session
tab clicks. Added useEffect reset hooks that clear all local state
when sessionId changes, allowing the component to properly reload
the new session's data.

Also renamed "Box N" to "Seite N" in BoxSessionTabs per user feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 12:04:23 +01:00
Benjamin Admin
49a36364a8 Add double-page split support to OCR Overlay (Kombi 7 Schritte)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
The page-split detection was only implemented in the regular pipeline
page but not in the OCR Overlay page where the user actually tests
with Kombi mode. Now the overlay page has full sub-session support:

- openSession: handles sub_sessions, parent_session_id, skip logic
  for page-split vs crop-based sub-sessions, preserves current mode
- handleOrientationComplete: async, fetches API to detect sub-sessions
- BoxSessionTabs: shown between stepper and step content
- handleNext: returns to parent after sub-session completion
- handleSessionChange/handleBoxSessionsCreated: session switching

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 11:48:26 +01:00
Benjamin Admin
14fd8e0b1e Fix page-split: fetch sub-sessions from API instead of React state
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
handleOrientationComplete was checking subSessions from React state,
but due to batching the state was still empty when the user clicked
"Seiten verarbeiten". Now fetches session data directly from the API
to reliably detect sub-sessions and auto-open the first one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 11:22:15 +01:00
Benjamin Admin
247b79674d Add double-page spread detection to frontend pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 36s
CI / test-go-edu-search (push) Successful in 34s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
After orientation detection, the frontend now automatically calls the
page-split endpoint. When a double-page book spread is detected, two
sub-sessions are created and each goes through the full pipeline
(deskew/dewarp/crop) independently — essential because each page of a
spread tilts differently due to the spine.

Frontend changes:
- StepOrientation: calls POST /page-split after orientation, shows
  split info ("Doppelseite erkannt"), notifies parent of sub-sessions
- page.tsx: distinguishes page-split sub-sessions (current_step < 5)
  from crop-based sub-sessions (current_step >= 5). Page-split subs
  only skip orientation, not deskew/dewarp/crop.
- page.tsx: handleOrientationComplete opens first sub-session when
  page-split was detected

Backend changes (orientation_crop_api.py):
- page-split endpoint falls back to original image when orientation
  rotated a landscape spread to portrait
- start_step parameter: 1 if split from original, 2 if from oriented

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 11:09:44 +01:00
Benjamin Admin
40815dafd1 feat(ocr-pipeline): add page-split endpoint for double-page book spreads
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 20s
Each page of a double-page scan tilts differently due to the book spine.
The new POST /page-split endpoint detects spreads after orientation and
creates sub-sessions that go through the full pipeline (deskew, dewarp,
crop, etc.) individually, so each page gets its own deskew correction.

Also fixes border-strip filter incorrectly removing German translation
words by adding a decorative-strip validation check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 10:53:06 +01:00
Benjamin Admin
2a21127f01 fix(ocr-pipeline): improve page crop spine detection and cell assignment
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
1. page_crop: Score all dark runs by center-proximity × darkness ×
   narrowness instead of picking the widest. Fixes ad810209 where a
   wide dark area at 35% was chosen over the actual spine at 50%.

2. cv_words_first: Replace x-center-only word→column assignment with
   overlap-based three-pass strategy (overlap → midpoint-range → nearest).
   Fixes truncated German translations like "Schal" instead of
   "Schal - die Schals" in session 079cd0d9.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 09:23:30 +01:00
Benjamin Admin
9d34c5201e feat(grid-editor): add manual cell color control via right-click menu
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 23s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 13s
CI / test-nodejs-website (push) Successful in 15s
Users can now right-click any cell to set text color (red, green, blue,
orange, purple, black) or remove the color bar without changing text.
A "reset" option restores the OCR-detected color. This enables accurate
Ground Truth marking when OCR assigns colors to wrong cells.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 08:51:18 +01:00
Benjamin Admin
d54814fa70 feat: color bar respects edits + column pattern auto-correction
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 23s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 15s
- Color bar (red/colored indicator) now only shows when word_boxes
  text still matches the cell text — editing the cell hides stale colors
- New "Auto-Korrektur" button: detects dominant prefix+number patterns
  per column (e.g. p.70, p.71) and completes partial entries (.65 → p.65)
  — requires 3+ matching entries before correcting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 08:38:11 +01:00
Benjamin Admin
d6f4944bcc fix: remove maxHeight limit on grid editor — shows all rows
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 19s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 08:24:50 +01:00
Benjamin Admin
ee0d9c881e fix: column resize handle now accessible above add/delete buttons
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 15s
Resize handle: wider (9px), z-40 (above z-30 buttons).
Add-column button moved to bottom-right corner to avoid overlap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 08:20:04 +01:00
Benjamin Admin
65f4ce1947 feat: ImageLayoutEditor, arrow-key nav, multi-select bold, wider columns
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 32s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
- New ImageLayoutEditor: SVG overlay on original scan with draggable
  column dividers, horizontal guidelines (margins/header/footer),
  double-click to add columns, x-button to delete
- GridTable: MIN_COL_WIDTH 40→80px for better readability
- Arrow up/down keys navigate between rows in the grid editor
- Ctrl+Click for multi-cell selection, Ctrl+B to toggle bold on selection
- getAdjacentCell works for cells that don't exist yet (new rows/cols)
- deleteColumn now merges x-boundaries correctly
- Session restore fix: grid_editor_result/structure_result in session GET
- Footer row 3-state cycle, auto-create cells for empty footer rows
- Grid save/build/GT-mark now advance current_step=11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 07:45:39 +01:00
Benjamin Admin
4e668660a7 feat: add Woerterbuch category + column add/delete in grid editor
- New document category "Woerterbuch" (frontend type + backend validation)
- Column delete: hover column header → red "x" button (with confirmation)
- Column add: hover column header → "+" button inserts after that column
- Both operations support undo/redo, update cell IDs and summary
- Available in both GridEditor and StepGridReview (Kombi last step)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 16:27:12 +01:00
Benjamin Admin
7a6eadde8b feat: integrate Ground Truth review into Kombi Pipeline last step
- New StepGridReview component: split-view (scan image left, grid right),
  confidence stats, row-accept buttons, zoom controls
- Kombi Pipeline case 6 now uses StepGridReview instead of plain GridEditor
- Kombi step label changed to "Review & GT"
- Ground Truth queue page simplified to overview/navigation only
  (links to Kombi pipeline for actual review work)
- Deep-link support: /ai/ocr-overlay?session=xxx&mode=kombi

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 15:04:23 +01:00
Benjamin Admin
4e809c3860 fix: ground-truth crash on col_type + remove AIToolsSidebarResponsive from model-management
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
- Ground-truth: zone.columns use 'label' not 'col_type' — calling
  .replace() on undefined crashed the page after grid data loaded
- Model-management: same AIToolsSidebarResponsive wrapper bug as the
  other pages — does not render children

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 10:14:02 +01:00
Benjamin Admin
dccbb909bc fix: remove AIToolsSidebarResponsive wrapper from ground-truth and regression pages
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
AIToolsSidebarResponsive does not accept children — it renders only a
sidebar nav. Using it as a wrapper caused page content to never render.
Replaced with plain div, matching the pattern used by ocr-pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 09:57:52 +01:00
Benjamin Admin
be7f5f1872 feat: Sprint 2 — TrOCR ONNX, PP-DocLayout, Model Management
D2: TrOCR ONNX export script (printed + handwritten, int8 quantization)
D3: PP-DocLayout ONNX export script (download or Docker-based conversion)
B3: Model Management admin page (PyTorch vs ONNX status, benchmarks, config)
A4: TrOCR ONNX service with runtime routing (auto/pytorch/onnx via TROCR_BACKEND)
A5: PP-DocLayout ONNX detection with OpenCV fallback (via GRAPHIC_DETECT_BACKEND)
B4: Structure Detection UI toggle (OpenCV vs PP-DocLayout) with class color coding
C3: TrOCR-ONNX.md documentation
C4: OCR-Pipeline.md ONNX section added
C5: mkdocs.yml nav updated, optimum added to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 09:53:02 +01:00
Benjamin Admin
c695b659fb fix: PagePurpose props on ground-truth and regression pages
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 17s
Both pages passed `moduleId` which is not a valid prop for PagePurpose.
The component expects explicit title/purpose/audience — calling
audience.join() on undefined caused the client-side crash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 09:43:36 +01:00
Benjamin Admin
a1e079b911 feat: Sprint 1 — IPA hardening, regression framework, ground-truth review
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 19s
Track A (Backend):
- Compound word IPA decomposition (schoolbag→school+bag)
- Trailing garbled IPA fragment removal after brackets (R21 fix)
- Regression runner with DB persistence, history endpoints
- Page crop determinism verified with tests

Track B (Frontend):
- OCR Regression dashboard (/ai/ocr-regression)
- Ground Truth Review workflow (/ai/ocr-ground-truth)
  with split-view, confidence highlighting, inline edit,
  batch mark, progress tracking

Track C (Docs):
- OCR-Pipeline.md v5.0 (Steps 5e-5h)
- Regression testing guide
- mkdocs.yml nav update

Track D (Infra):
- TrOCR baseline benchmark script
- run-regression.sh shell script
- Migration 008: regression_runs table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 09:21:27 +01:00
Benjamin Admin
f5d5d6c59c docs: add Vision, Roadmap, and Hardware strategy to MkDocs
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Add three new Projekt documentation pages covering product vision
(offline-first desktop app for teachers), 6-phase development roadmap,
and 3-tier hardware strategy with distribution plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-23 08:54:22 +01:00
Benjamin Admin
4a44ad7986 fix: hard-filter OCR words inside detected graphic regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
Run detect_graphic_elements() in the grid pipeline after image loading
and remove ALL words whose centroids fall inside detected graphic regions,
regardless of confidence. Previously only low-confidence words (conf < 50)
were removed, letting artifacts like "Tr", "Su" survive.

Changes:
- grid_editor_api.py: Import and call detect_graphic_elements() at Step 3a,
  passing only significant words (len >= 3) to avoid short artifacts fooling
  the text-vs-graphic heuristic. Hard-filter all words in graphic regions.
- cv_graphic_detect.py: Lower density threshold from 20% to 5% for large
  regions (>100x80px) — photos/illustrations have low color saturation.
  Raise page-spanning limit from 50% to 60% width/height.

Tested: 5 ground-truth sessions pass regression (079cd0d9, d8533a2c,
2838c7a7, 4233d7e3, 5997b635). Session 5997 now detects 2 graphic regions
and removes 29 artifact words including "Tr" and "Su".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 10:18:23 +01:00
Benjamin Admin
7b3319be2e fix: merge syllable-split word_boxes + keep dictionary guide words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
OCR splits words at syllable marks into overlapping word_boxes (e.g.
"zu" + "tiefst" with 52% x-overlap). Step 5i previously removed the
lower-confidence box, losing the prefix. Now: when both boxes are
alphabetic text with 20-75% overlap, MERGE them into one word_box
("zutiefst") instead of removing.

Also relaxed artifact cell filter: 2-char alphabetic text like "Zw"
(dictionary guide word) is no longer removed. Only non-alphabetic
short text like "a=" is filtered.

Results for session 5997: "tiefst"→"zutiefst", "zu"→"zuständig",
"Zu die Zuschüsse"→"Zuschuss, die Zuschüsse", "Zw" restored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 08:21:00 +01:00
Benjamin Admin
882b177fc3 fix: remove image-area artifacts + fix heading false positive for dictionary entries
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Three fixes for dictionary page session 5997:

1. Heading detection: column_1 cells with article words (die/der/das)
   now count as content cells, preventing "die Zuschrift, die Zuschriften"
   from being falsely merged into a spanning heading cell.

2. Step 5j-pre: new artifact cell filter removes short garbled text from
   OCR on image areas (e.g. "7 EN", "Tr", "\\", "PEE", "a="). Cells
   survive earlier filters because their rows have real content in other
   columns. Also cleans up empty rows after removal.

3. Footer "PEE" auto-fixed: artifact filter removes the noise cell,
   empty row gets cleaned up, footer detection no longer sees it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 07:59:24 +01:00
Benjamin Admin
1fae39dbb8 fix: lower secondary column threshold + strip pipe chars from word_boxes
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 21s
CI / test-nodejs-website (push) Successful in 18s
Dictionary pages have 2 dictionary columns, each with article + headword
sub-columns. The right article column (die/der at x≈626) had only 14.3%
row coverage — below the 20% secondary threshold. Lowered to 12% so
dictionary article columns qualify. Also strip pipe characters from
individual word_box text (not just cell text) to remove OCR syllable
separation marks (e.g. "zu|trau|en" → "zutrauen").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 07:44:03 +01:00
Benjamin Admin
46c8c28d34 fix: border strip pre-filter + 3-column detection for vocabulary tables
The border strip filter (Step 4e) used the LARGEST x-gap which incorrectly
removed base words along with edge artifacts. Now uses a two-stage approach:
1. _filter_border_strip_words() pre-filters raw words BEFORE column detection,
   scanning from the page edge inward to find the FIRST significant gap (>30px)
2. Step 4e runs as fallback only when pre-filter didn't apply

Session 4233 now correctly detects 3 columns (base word | oder | synonyms)
instead of 2. Threshold raised from 15% to 20% to handle pages with many
edge artifacts. All 4 ground-truth sessions pass regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 21:01:43 +01:00
Benjamin Admin
4000110501 fix: extend tiny symbol filter to all non-black colors, raise area to 200
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
Step 5i rule (a) only caught blue tiny symbols. Graphic fragments from
page illustrations (e.g. orange quote mark from man illustration) were
missed. Now filters any non-black colored word_box with area < 200 and
confidence < 85.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 18:05:31 +01:00
Benjamin Admin
2acf8696bf fix: correct border strip test data to avoid false internal gaps
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 36s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 17s
Content word_boxes in test used x-spacing (i%3)*100 which created
internal gaps larger than the border-to-content gap. Changed to
(i%2)*51 so content words overlap and the border gap remains dominant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 17:24:33 +01:00
Benjamin Admin
c0e1118870 feat: detect and remove page-border decoration strip artifacts (Step 4e)
Textbooks with decorative alphabet strips along page edges produce
OCR artifacts (scattered colored letters at x<150 while real content
starts at x>=179). Step 4e detects a significant x-gap (>30px) between
a small cluster (<15% of total word_boxes) near the page edge and the
main content, then removes the border-strip word_boxes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 17:20:45 +01:00
Benjamin Admin
f31a7175a2 fix: normalize word_box order to reading order for frontend display (Step 5j)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
The frontend renders colored cells from the word_boxes array order,
not from cell.text. After post-processing steps (5i bullet removal etc),
word_boxes could remain in their original insertion order instead of
left-to-right reading order. Step 5j now explicitly sorts word_boxes
using _group_words_into_lines before the result is built.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 19:21:37 +01:00
Benjamin Admin
bacbfd88f1 Fix word ordering in cell text rebuild (Steps 4c, 4d, 5i)
Cell text was rebuilt using naive (top, left) sorting after removing
word_boxes in Steps 4c/4d/5i. This produced wrong word order when
words on the same visual line had slightly different top values (1-6px).

Now uses _words_to_reading_order_text() which groups words into visual
lines by y-tolerance before sorting by x within each line, matching
the initial cell text construction in _build_cells.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-20 18:45:33 +01:00