Add OCR Pipeline Extensions developer docs + update vocab-worksheet docs
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 39s
CI / test-python-klausur (push) Failing after 2m36s
CI / test-python-agent-core (push) Successful in 26s
CI / test-nodejs-website (push) Successful in 40s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 39s
CI / test-python-klausur (push) Failing after 2m36s
CI / test-python-agent-core (push) Successful in 26s
CI / test-nodejs-website (push) Successful in 40s
New: .claude/rules/ocr-pipeline-extensions.md - Complete documentation for SmartSpellChecker, Box-Grid-Review (Step 11), Ansicht/Spreadsheet (Step 12), Unified Grid - All 14 pipeline steps listed - Backend/frontend file structure with line counts - 66 tests documented - API endpoints, data flow, formatting rules Updated: .claude/rules/vocab-worksheet.md - Added Frontend Refactoring section (page.tsx → 14 files) - Updated format extension instructions (constants.ts instead of page.tsx) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
237
.claude/rules/ocr-pipeline-extensions.md
Normal file
237
.claude/rules/ocr-pipeline-extensions.md
Normal file
@@ -0,0 +1,237 @@
|
|||||||
|
# OCR Pipeline Erweiterungen - Entwicklerdokumentation
|
||||||
|
|
||||||
|
**Status:** Produktiv
|
||||||
|
**Letzte Aktualisierung:** 2026-04-15
|
||||||
|
**URL:** https://macmini:3002/ai/ocr-kombi
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Uebersicht
|
||||||
|
|
||||||
|
Erweiterungen der OCR Kombi Pipeline (14 Steps, 0-13):
|
||||||
|
- **SmartSpellChecker** — LLM-freie OCR-Korrektur mit Spracherkennung
|
||||||
|
- **Box-Grid-Review** (Step 11) — Eingebettete Boxen verarbeiten
|
||||||
|
- **Ansicht/Spreadsheet** (Step 12) — Fortune Sheet Excel-Editor
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pipeline Steps
|
||||||
|
|
||||||
|
| Step | ID | Name | Komponente |
|
||||||
|
|------|----|------|------------|
|
||||||
|
| 0 | upload | Upload | StepUpload |
|
||||||
|
| 1 | orientation | Orientierung | StepOrientation |
|
||||||
|
| 2 | page-split | Seitentrennung | StepPageSplit |
|
||||||
|
| 3 | deskew | Begradigung | StepDeskew |
|
||||||
|
| 4 | dewarp | Entzerrung | StepDewarp |
|
||||||
|
| 5 | content-crop | Zuschneiden | StepContentCrop |
|
||||||
|
| 6 | ocr | OCR | StepOcr |
|
||||||
|
| 7 | structure | Strukturerkennung | StepStructure |
|
||||||
|
| 8 | grid-build | Grid-Aufbau | StepGridBuild |
|
||||||
|
| 9 | grid-review | Grid-Review | StepGridReview |
|
||||||
|
| 10 | gutter-repair | Wortkorrektur | StepGutterRepair |
|
||||||
|
| **11** | **box-review** | **Box-Review** | **StepBoxGridReview** |
|
||||||
|
| **12** | **ansicht** | **Ansicht** | **StepAnsicht** |
|
||||||
|
| 13 | ground-truth | Ground Truth | StepGroundTruth |
|
||||||
|
|
||||||
|
Step-Definitionen: `admin-lehrer/app/(admin)/ai/ocr-kombi/types.ts`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SmartSpellChecker
|
||||||
|
|
||||||
|
**Datei:** `klausur-service/backend/smart_spell.py`
|
||||||
|
**Tests:** `tests/test_smart_spell.py` (43 Tests)
|
||||||
|
**Lizenz:** Nur pyspellchecker (MIT) — kein LLM, kein Hunspell
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
| Feature | Methode |
|
||||||
|
|---------|---------|
|
||||||
|
| Spracherkennung | Dual-Dictionary EN/DE Heuristik |
|
||||||
|
| a/I Disambiguation | Bigram-Kontext (Folgewort-Lookup) |
|
||||||
|
| Boundary Repair | Frequenz-basiert: `Pound sand`→`Pounds and` |
|
||||||
|
| Context Split | `anew`→`a new` (Allow/Deny-Liste) |
|
||||||
|
| Multi-Digit | BFS: `sch00l`→`school` |
|
||||||
|
| Cross-Language Guard | DE-Woerter in EN-Spalte nicht falsch korrigieren |
|
||||||
|
| Umlaut-Korrektur | `Schuler`→`Schueler` |
|
||||||
|
| IPA-Schutz | Inhalte in [Klammern] nie aendern |
|
||||||
|
| Slash→l | `p/`→`pl` (kursives l als / erkannt) |
|
||||||
|
| Abkuerzungen | 120+ aus `_KNOWN_ABBREVIATIONS` |
|
||||||
|
|
||||||
|
### Integration
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In cv_review.py (LLM Review Step):
|
||||||
|
from smart_spell import SmartSpellChecker
|
||||||
|
_smart = SmartSpellChecker()
|
||||||
|
result = _smart.correct_text(text, lang="en") # oder "de" oder "auto"
|
||||||
|
|
||||||
|
# In grid_editor_api.py (Grid Build + Box Build):
|
||||||
|
# Automatisch nach Grid-Aufbau und Box-Grid-Aufbau
|
||||||
|
```
|
||||||
|
|
||||||
|
### Frequenz-Scoring
|
||||||
|
|
||||||
|
Boundary Repair vergleicht Wort-Frequenz-Produkte:
|
||||||
|
- `old_freq = word_freq(w1) * word_freq(w2)`
|
||||||
|
- `new_freq = word_freq(repaired_w1) * word_freq(repaired_w2)`
|
||||||
|
- Akzeptiert wenn `new_freq > old_freq * 5`
|
||||||
|
- Abkuerzungs-Bonus nur wenn Original-Woerter selten (freq < 1e-6)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Box-Grid-Review (Step 11)
|
||||||
|
|
||||||
|
**Frontend:** `admin-lehrer/components/ocr-kombi/StepBoxGridReview.tsx`
|
||||||
|
**Backend:** `klausur-service/backend/cv_box_layout.py`, `grid_editor_api.py`
|
||||||
|
**Tests:** `tests/test_box_layout.py` (13 Tests)
|
||||||
|
|
||||||
|
### Backend-Endpoints
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/v1/ocr-pipeline/sessions/{id}/build-box-grids
|
||||||
|
```
|
||||||
|
|
||||||
|
Verarbeitet alle erkannten Boxen aus `structure_result`:
|
||||||
|
1. Filtert Header/Footer-Boxen (obere/untere 7% der Bildhoehe)
|
||||||
|
2. Extrahiert OCR-Woerter pro Box aus `raw_paddle_words`
|
||||||
|
3. Klassifiziert Layout: `flowing` | `columnar` | `bullet_list` | `header_only`
|
||||||
|
4. Baut Grid mit layout-spezifischer Logik
|
||||||
|
5. Wendet SmartSpellChecker an
|
||||||
|
|
||||||
|
### Box Layout Klassifikation (`cv_box_layout.py`)
|
||||||
|
|
||||||
|
| Layout | Erkennung | Grid-Aufbau |
|
||||||
|
|--------|-----------|-------------|
|
||||||
|
| `header_only` | ≤5 Woerter oder 1 Zeile | 1 Zelle, alles zusammen |
|
||||||
|
| `flowing` | Gleichmaessige Zeilenbreite | 1 Spalte, Bullet-Gruppierung per Einrueckung |
|
||||||
|
| `bullet_list` | ≥40% Zeilen mit Bullet-Marker | 1 Spalte, Bullet-Items |
|
||||||
|
| `columnar` | Mehrere X-Cluster | Standard-Spaltenerkennung |
|
||||||
|
|
||||||
|
### Bullet-Einrueckung
|
||||||
|
|
||||||
|
Erkennung ueber Left-Edge-Analyse:
|
||||||
|
- Minimale Einrueckung = Bullet-Ebene
|
||||||
|
- Zeilen mit >15px mehr Einrueckung = Folgezeilen
|
||||||
|
- Folgezeilen werden mit `\n` in die Bullet-Zelle integriert
|
||||||
|
- Fehlende `•` Marker werden automatisch ergaenzt
|
||||||
|
|
||||||
|
### Colspan-Erkennung (`grid_editor_helpers.py`)
|
||||||
|
|
||||||
|
Generische Funktion `_detect_colspan_cells()`:
|
||||||
|
- Laeuft nach `_build_cells()` fuer ALLE Zonen
|
||||||
|
- Nutzt Original-Wort-Bloecke (vor `_split_cross_column_words`)
|
||||||
|
- Wort-Block der ueber Spaltengrenze reicht → `spanning_header` mit `colspan=N`
|
||||||
|
- Beispiel: "In Britain you pay with pounds and pence." ueber 2 Spalten
|
||||||
|
|
||||||
|
### Spalten-Erkennung in Boxen
|
||||||
|
|
||||||
|
Fuer kleine Zonen (≤60 Woerter):
|
||||||
|
- `gap_threshold = max(median_h * 1.0, 25)` statt `3x median`
|
||||||
|
- PaddleOCR liefert Multi-Word-Bloecke → alle Gaps sind Spalten-Gaps
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ansicht / Spreadsheet (Step 12)
|
||||||
|
|
||||||
|
**Frontend:** `admin-lehrer/components/ocr-kombi/StepAnsicht.tsx`, `SpreadsheetView.tsx`
|
||||||
|
**Bibliothek:** `@fortune-sheet/react` (MIT, v1.0.4)
|
||||||
|
|
||||||
|
### Architektur
|
||||||
|
|
||||||
|
Split-View:
|
||||||
|
- **Links:** Original-Scan mit OCR-Overlay (`/image/words-overlay`)
|
||||||
|
- **Rechts:** Fortune Sheet Spreadsheet mit Multi-Sheet-Tabs
|
||||||
|
|
||||||
|
### Multi-Sheet Ansatz
|
||||||
|
|
||||||
|
Jede Zone wird ein eigenes Sheet-Tab:
|
||||||
|
- Sheet "Vokabeln" — Hauptgrid mit EN/DE Spalten
|
||||||
|
- Sheet "Pounds and euros" — Box 1 mit eigenen 4 Spalten
|
||||||
|
- Sheet "German leihen" — Box 2 als Fliesstexttext
|
||||||
|
|
||||||
|
Grund: Spaltenbreiten sind pro Zone unterschiedlich optimiert. Excel-Limitation: Spaltenbreite gilt fuer die ganze Spalte.
|
||||||
|
|
||||||
|
### Zell-Formatierung
|
||||||
|
|
||||||
|
| Format | Quelle | Fortune Sheet Property |
|
||||||
|
|--------|--------|----------------------|
|
||||||
|
| Fett | `is_header`, `is_bold`, groessere Schrift | `bl: 1` |
|
||||||
|
| Schriftfarbe | OCR word_boxes color | `fc: '#hex'` |
|
||||||
|
| Hintergrund | Box bg_hex, Header | `bg: '#hex08'` |
|
||||||
|
| Text-Wrap | Mehrzeilige Zellen (\n) | `tb: '2'` |
|
||||||
|
| Vertikal oben | Mehrzeilige Zellen | `vt: 0` |
|
||||||
|
| Groessere Schrift | word_box height >1.3x median | `fs: 12` |
|
||||||
|
|
||||||
|
### Spaltenbreiten
|
||||||
|
|
||||||
|
Auto-Fit: `max(laengster_text * 7.5 + 16, original_px * scaleFactor)`
|
||||||
|
|
||||||
|
### Toolbar
|
||||||
|
|
||||||
|
`undo, redo, font-bold, font-italic, font-strikethrough, font-color, background, font-size, horizontal-align, vertical-align, text-wrap, merge-cell, border`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Unified Grid (Backend)
|
||||||
|
|
||||||
|
**Datei:** `klausur-service/backend/unified_grid.py`
|
||||||
|
**Tests:** `tests/test_unified_grid.py` (10 Tests)
|
||||||
|
|
||||||
|
Mergt alle Zonen in ein einzelnes Grid (fuer Export/Analyse):
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/v1/ocr-pipeline/sessions/{id}/build-unified-grid
|
||||||
|
GET /api/v1/ocr-pipeline/sessions/{id}/unified-grid
|
||||||
|
```
|
||||||
|
|
||||||
|
- Dominante Zeilenhoehe = Median der Content-Row-Abstaende
|
||||||
|
- Full-Width Boxen: Rows direkt integriert
|
||||||
|
- Partial-Width Boxen: Extra-Rows eingefuegt wenn Box mehr Zeilen hat
|
||||||
|
- Box-Zellen mit `source_zone_type: "box"` und `box_region` Metadaten
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dateistruktur
|
||||||
|
|
||||||
|
### Backend (klausur-service)
|
||||||
|
|
||||||
|
| Datei | Zeilen | Beschreibung |
|
||||||
|
|-------|--------|--------------|
|
||||||
|
| `grid_build_core.py` | 1943 | `_build_grid_core()` — Haupt-Grid-Aufbau |
|
||||||
|
| `grid_editor_api.py` | 474 | REST-Endpoints (build, save, get, gutter, box, unified) |
|
||||||
|
| `grid_editor_helpers.py` | 1737 | Helper: Spalten, Rows, Cells, Colspan, Header |
|
||||||
|
| `smart_spell.py` | 587 | SmartSpellChecker |
|
||||||
|
| `cv_box_layout.py` | 339 | Box-Layout-Klassifikation + Grid-Aufbau |
|
||||||
|
| `unified_grid.py` | 425 | Unified Grid Builder |
|
||||||
|
|
||||||
|
### Frontend (admin-lehrer)
|
||||||
|
|
||||||
|
| Datei | Zeilen | Beschreibung |
|
||||||
|
|-------|--------|--------------|
|
||||||
|
| `StepBoxGridReview.tsx` | 283 | Box-Review Step 11 |
|
||||||
|
| `StepAnsicht.tsx` | 112 | Ansicht Step 12 (Split-View) |
|
||||||
|
| `SpreadsheetView.tsx` | ~160 | Fortune Sheet Integration |
|
||||||
|
| `GridTable.tsx` | 652 | Grid-Editor Tabelle (Steps 9-11) |
|
||||||
|
| `useGridEditor.ts` | 985 | Grid-Editor Hook |
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
|
||||||
|
| Datei | Tests | Beschreibung |
|
||||||
|
|-------|-------|--------------|
|
||||||
|
| `test_smart_spell.py` | 43 | Spracherkennung, Boundary Repair, IPA-Schutz |
|
||||||
|
| `test_box_layout.py` | 13 | Layout-Klassifikation, Bullet-Gruppierung |
|
||||||
|
| `test_unified_grid.py` | 10 | Unified Grid, Box-Klassifikation |
|
||||||
|
| **Gesamt** | **66** | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Aenderungshistorie
|
||||||
|
|
||||||
|
| Datum | Aenderung |
|
||||||
|
|-------|-----------|
|
||||||
|
| 2026-04-15 | Fortune Sheet Multi-Sheet Tabs, Bullet-Points, Auto-Fit, Refactoring |
|
||||||
|
| 2026-04-14 | Unified Grid, Ansicht Step, Colspan-Erkennung |
|
||||||
|
| 2026-04-13 | Box-Grid-Review Step, Spalten in Boxen, Header/Footer Filter |
|
||||||
|
| 2026-04-12 | SmartSpellChecker, Frequency Scoring, IPA-Schutz, Vocab-Worksheet Refactoring |
|
||||||
@@ -188,11 +188,35 @@ ssh macmini "docker compose up -d klausur-service studio-v2"
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Frontend Refactoring (2026-04-12)
|
||||||
|
|
||||||
|
`page.tsx` wurde von 2337 Zeilen in 14 Dateien aufgeteilt:
|
||||||
|
|
||||||
|
```
|
||||||
|
studio-v2/app/vocab-worksheet/
|
||||||
|
├── page.tsx # 198 Zeilen — Orchestrator
|
||||||
|
├── types.ts # Interfaces, VocabWorksheetHook
|
||||||
|
├── constants.ts # API-Base, Formats, Defaults
|
||||||
|
├── useVocabWorksheet.ts # 843 Zeilen — Custom Hook (alle State + Logik)
|
||||||
|
└── components/
|
||||||
|
├── UploadScreen.tsx # Session-Liste + Dokument-Auswahl
|
||||||
|
├── PageSelection.tsx # PDF-Seitenauswahl
|
||||||
|
├── VocabularyTab.tsx # Vokabel-Tabelle + IPA/Silben
|
||||||
|
├── WorksheetTab.tsx # Format-Auswahl + Konfiguration
|
||||||
|
├── ExportTab.tsx # PDF-Download
|
||||||
|
├── OcrSettingsPanel.tsx # OCR-Filter Einstellungen
|
||||||
|
├── FullscreenPreview.tsx # Vollbild-Vorschau Modal
|
||||||
|
├── QRCodeModal.tsx # QR-Upload Modal
|
||||||
|
└── OcrComparisonModal.tsx # OCR-Vergleich Modal
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Erweiterung: Neue Formate hinzufuegen
|
## Erweiterung: Neue Formate hinzufuegen
|
||||||
|
|
||||||
1. **Backend**: Neuen Generator in `klausur-service/backend/` erstellen
|
1. **Backend**: Neuen Generator in `klausur-service/backend/` erstellen
|
||||||
2. **API**: Neuen Endpoint in `vocab_worksheet_api.py` hinzufuegen
|
2. **API**: Neuen Endpoint in `vocab_worksheet_api.py` hinzufuegen
|
||||||
3. **Frontend**: Format zu `worksheetFormats` Array in `page.tsx` hinzufuegen
|
3. **Frontend**: Format zu `worksheetFormats` Array in `constants.ts` hinzufuegen
|
||||||
4. **Doku**: Diese Datei aktualisieren
|
4. **Doku**: Diese Datei aktualisieren
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
Reference in New Issue
Block a user