Add gutter repair step to OCR Kombi pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 41s
CI / test-go-edu-search (push) Successful in 36s
CI / test-python-klausur (push) Failing after 2m31s
CI / test-python-agent-core (push) Successful in 28s
CI / test-nodejs-website (push) Successful in 29s

New step "Wortkorrektur" between Grid-Review and Ground Truth that detects
and fixes words truncated or blurred at the book gutter (binding area) of
double-page scans. Uses pyspellchecker (DE+EN) for validation.

Two repair strategies:
- hyphen_join: words split across rows with missing chars (ve + künden → verkünden)
- spell_fix: garbled trailing chars from gutter blur (stammeli → stammeln)

Interactive frontend with per-suggestion accept/reject and batch controls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-10 18:50:16 +02:00
parent 21b69e06be
commit 71e1b10ac7
7 changed files with 1376 additions and 3 deletions

View File

@@ -15,6 +15,7 @@ import { StepOcr } from '@/components/ocr-kombi/StepOcr'
import { StepStructure } from '@/components/ocr-kombi/StepStructure'
import { StepGridBuild } from '@/components/ocr-kombi/StepGridBuild'
import { StepGridReview } from '@/components/ocr-kombi/StepGridReview'
import { StepGutterRepair } from '@/components/ocr-kombi/StepGutterRepair'
import { StepGroundTruth } from '@/components/ocr-kombi/StepGroundTruth'
import { useKombiPipeline } from './useKombiPipeline'
@@ -93,6 +94,8 @@ function OcrKombiContent() {
case 9:
return <StepGridReview sessionId={sessionId} onNext={handleNext} saveRef={gridSaveRef} />
case 10:
return <StepGutterRepair sessionId={sessionId} onNext={handleNext} />
case 11:
return (
<StepGroundTruth
sessionId={sessionId}