Commit Graph

54 Commits

Author SHA1 Message Date
Benjamin Admin
d5f2ce4659 fix: Fabric.js v6 API compatibility + CLAUDE.md SSH commands
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 15s
- Replace setBackgroundImage() with backgroundImage property (v6 breaking change)
- Replace setWidth/setHeight with Canvas constructor options
- Fix opacity handler to use direct property access
- Update CLAUDE.md: use git -C and docker compose -f instead of cd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 23:01:19 +01:00
Benjamin Admin
ab3ecc7c08 feat: OCR pipeline v2.1 – narrow column OCR, dewarp automation, Fabric.js editor
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 15s
Proposal B: Adaptive padding, crop upscaling, PSM selection, row-strip re-OCR
for narrow columns (<15% width) – expected accuracy boost 60-70% → 85-90%.

Proposal A: New text-line straightness detector (Method D), quality gate
(rejects counterproductive corrections), 2-pass projection refinement,
higher confidence thresholds – expected manual dewarp reduction to <10%.

Proposal C: Fabric.js canvas editor with drag/drop, inline editing, undo/redo,
opacity slider, zoom, PDF/DOCX export endpoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 22:44:14 +01:00
Benjamin Admin
a610bc75ba fix: rename LLM-Korrektur to Korrektur in wizard stepper and types 2026-03-03 17:56:46 +01:00
Benjamin Admin
153f41358b fix: remove stale allCells dependency in emptyCellIds memo 2026-03-03 17:39:14 +01:00
Benjamin Admin
d1c8075da2 fix: three OCR pipeline UX improvements
1. Rename Step 6 label to "Korrektur" (was "OCR-Zeichenkorrektur")
2. Move _fix_character_confusion from pipeline Step 1 into
   llm_review_entries_streaming so corrections are visible in the UI:
   char changes (| → I, 1 → I, 8 → B) are now emitted as a batch event
   right after the meta event, appearing in the corrections list
3. StepReconstruction: all cells (including empty) are now rendered as
   editable inputs — removed filter that hid empty cells from the editor

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 17:31:55 +01:00
Benjamin Admin
123b7ada0b fix(columns): filter phantom narrow columns + rename step to OCR-Zeichenkorrektur
Phantom column fix:
Adjacent tiny gaps (e.g. 11px + 35px) can create very narrow columns
(< 3% of content width) with 0 words. These are scan artefacts, not
real columns. New Step 9 in detect_column_geometry():
- Filter columns where width < max(20px, 3% content_w) AND words < 3
- After filtering, extend each remaining column to close the gap with
  its right neighbor, and re-assign words to correct column

Example from logs: 5 columns → 4 columns (phantom at x=710, width=36px
eliminated; neighbors expanded to cover the gap)

UI rename:
- 'Schritt 6: LLM-Korrektur' → 'Schritt 6: OCR-Zeichenkorrektur'
- 'LLM-Korrektur starten' → 'Zeichenkorrektur starten'
- Error message updated accordingly
(No LLM involved anymore — spell-checker is the active engine)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:06:59 +01:00
Benjamin Admin
ccba2bb887 fix(ocr-pipeline): show sub-columns in reconstruction and LLM review steps
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 21s
- Add marker/bbox_marker fields to WordEntry type
- Add page_ref/column_marker colors to StepReconstruction
- Make StepLlmReview table dynamic based on columns_used metadata,
  showing all detected columns (EN, DE, Example, page_ref, marker)
  instead of hardcoded EN/DE/Beispiel only

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 10:36:27 +01:00
Benjamin Admin
4d428980c1 refactor(word-step): make table fully generic and fix marker-only row filter
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m43s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
Frontend: Replace hardcoded EN/DE/Example vocab table with unified dynamic
table driven by columns_used from backend. Labeling, confirmation, counts,
and summary badges are now all cell-based instead of branching on isVocab.

Backend: Change _cells_to_vocab_entries() entry filter from checking only
english/german/example to checking ANY mapped field. This preserves rows
with only marker or source_page content, fixing the issue where marker
sub-columns disappeared at the end of OCR processing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:45:24 +01:00
Benjamin Admin
dea3349b23 fix(ocr-pipeline): preserve sub-column data in vocab table display
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Three fixes for sub-columns disappearing at end of streaming:

1. Backend: add column_marker mapping in _cells_to_vocab_entries()
   so marker text is included in vocab entries (not silently dropped)

2. Frontend types: add source_page and bbox_ref to WordEntry interface

3. Frontend table: show page_ref column (Seite) in vocab table when
   entries have source_page data, instead of only EN/DE/Example

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:06:15 +01:00
Benjamin Admin
e718353d9f feat(ocr-pipeline): 6 systematic improvements for robustness, performance & UX
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
   cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
   (lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
   column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
   highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
   any step, with reprocess button on completed pipeline steps

Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:46:38 +01:00
Benjamin Admin
dbf0db0c13 feat(ocr-pipeline): improve LLM review UI + add reconstruction step
StepLlmReview: Show full vocab table with image overlay, row-level
status tracking (pending/active/reviewed/corrected/skipped), and
auto-scroll during SSE streaming. Load previous results on mount.

StepReconstruction: New step 7 with editable text fields at original
bbox positions over dewarped image. Zoom controls, tab navigation,
color-coded columns, save to backend.

Backend: Add POST /sessions/{id}/reconstruction endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:19:21 +01:00
Benjamin Admin
2a493890b6 feat(ocr-pipeline): add SSE streaming and phonetic filter to LLM review
- Stream LLM review results batch-by-batch (8 entries per batch) via SSE
- Frontend shows live progress bar, batch log, and corrections appearing
- Skip entries with IPA phonetic transcriptions (already dictionary-corrected)
- Refactor llm_review_entries into reusable helpers for both streaming and non-streaming paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:46:06 +01:00
Benjamin Admin
938d1d69cf feat(ocr-pipeline): add LLM-based OCR correction step (Step 6)
Replace the placeholder "Koordinaten" step with an LLM review step that
sends vocab entries to qwen3:30b-a3b via Ollama for OCR error correction
(e.g. "8en" → "Ben"). Teachers can review, accept/reject individual
corrections in a diff table before applying them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:13:17 +01:00
Benjamin Admin
6db3c02db4 fix(admin-lehrer): force unique build ID to bust browser caches
Next.js was producing the same chunk hash across builds, causing
browsers to serve stale cached JS even after redeployment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 08:54:05 +01:00
Benjamin Admin
50ad06f43a fix(ocr-pipeline): always run fresh word detection, skip stale cache
Word-lookup is now ~0.03s (vs seconds with per-cell Tesseract), so
always re-run detection when entering Step 5 instead of showing
potentially stale cached word_result from the session DB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 08:05:13 +01:00
Benjamin Admin
7f27783008 feat(ocr-pipeline): add SSE streaming for word recognition (Step 5)
Cells now appear one-by-one in the UI as they are OCR'd, with a live
progress bar, instead of waiting for the full result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:54:20 +01:00
Benjamin Admin
27b895a848 feat(ocr-pipeline): generic cell-grid with optional vocab mapping
Extract build_cell_grid() as layout-agnostic foundation from
build_word_grid(). Step 5 now produces a generic cell grid (columns x
rows) and auto-detects whether vocab layout is present. Frontend
dynamically switches between vocab table (EN/DE/Example) and generic
cell table based on layout type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:22:56 +01:00
Benjamin Admin
854d8b431b feat(rag-qa): add 14 missing PDF mappings for EDPB, ENISA, EDPS, TMG, UrhG
Adds entries for all regulation codes in REGULATIONS_IN_RAG that were
missing from RAG_PDF_MAPPING, fixing "Kein PDF-Mapping" messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:10:09 +01:00
Benjamin Admin
f2521d2b9e feat(ocr-pipeline): British/American IPA pronunciation choice
- Integrate Britfone dictionary (MIT, 15k British English IPA entries)
- Add pronunciation parameter: 'british' (default) or 'american'
- British uses Britfone (Received Pronunciation), falls back to CMU
- American uses eng_to_ipa/CMU, falls back to Britfone
- Frontend: dropdown to switch pronunciation, default = British
- API: ?pronunciation=british|american query parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:08:52 +01:00
Benjamin Admin
954d21e469 fix: use local Inter font to avoid Google Fonts timeout in Docker build
The Docker container cannot reach Google Fonts, causing build failures.
Switch to bundled local font file using next/font/local.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:26:34 +01:00
Benjamin Admin
e3aa8e899e feat(rag-qa): add fullscreen mode for split-view chunk browser
Allows viewing chunks side-by-side with original PDF in fullscreen mode
for large screen QA review. Toggle via button or close with Escape key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:23:32 +01:00
Benjamin Admin
266b9dfad3 Fix PDF 404: default to bp_compliance_ce collection, add PDF existence check
Default collection changed from bp_compliance_gesetze (DE/AT/CH laws where
PDFs need manual download) to bp_compliance_ce (EU regulations where PDFs
are auto-downloaded). Added HEAD request check so missing PDFs show a clear
"PDF nicht vorhanden" message instead of a 404 in the iframe.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:13:26 +01:00
Benjamin Admin
b48cd8bb46 Fix ChunkBrowserQA layout: proper height constraints, remove bottom nav duplication
- Root container uses calc(100vh - 220px) for fixed viewport height
- All flex children use min-h-0 to enable proper overflow scrolling
- Removed duplicate bottom nav buttons (Zurueck/Weiter) that appeared
  in the middle of the chunk text — navigation is only in the header now
- Chunk text panel scrolls internally with fixed header
- Added prominent article/section badges in header and panel header
- Added chunk length quality indicator (warns on very short/long chunks)
- Structural metadata keys (article, section, pages) sorted first
- Sidebar shows regulation name instead of code for better readability
- PDF viewer uses pages metadata from payload when available

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 20:24:50 +01:00
Benjamin Admin
f7e0f2bb4f feat(ocr-pipeline): line breaks, hyphen rejoin & oversized row splitting
- Preserve \n between visual lines within cells (instead of joining with space)
- Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden)
- Split oversized rows (>1.5× median height) into sub-entries when EN/DE
  line counts match — deterministic fix for missed Step 4 row boundaries
- Frontend: render \n as <br/>, use textarea for multiline editing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 18:49:28 +01:00
Benjamin Admin
e7fb9d59f1 Fix ChunkBrowserQA: use regulation_id from Qdrant payload instead of regulation_code
The Qdrant collections use regulation_id (e.g. eu_2016_679) as the filter key,
not regulation_code (e.g. GDPR). Updated rag-constants.ts with correct qdrant_id
mappings from actual Qdrant data, fixed API to filter on regulation_id, and updated
ChunkBrowserQA to pass qdrant_id values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 18:22:12 +01:00
Benjamin Admin
8c42fefa77 feat(rag): add QA Split-View Chunk-Browser for ingestion verification
New ChunkBrowserQA component replaces inline chunk browser with:
- Document sidebar with live chunk counts per regulation (batched Qdrant count API)
- Sequential chunk navigation with arrow keys (1/N through all chunks of a document)
- Overlap display showing previous/next chunk boundaries (amber-highlighted)
- Split-view with original PDF via iframe (estimated page from chunk index)
- Adjustable chunks-per-page ratio for PDF page estimation

Extracts REGULATIONS_IN_RAG and REGULATION_INFO to shared rag-constants.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 17:46:11 +01:00
Benjamin Admin
45435f226f feat(ocr-pipeline): line grouping fix + RapidOCR integration
Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to
correctly order words in multi-line cells (fixes word reordering bug).

RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX
Runtime, native ARM64). Engine selectable via dropdown in UI or
?engine= query param. Auto mode prefers RapidOCR when available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 17:13:58 +01:00
Benjamin Admin
17604b8eb2 test: add tests for API proxy scroll/collection-count and Chunk-Browser logic
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m41s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 19s
42 tests covering:
- Qdrant scroll endpoint proxy (offset, limit, filters, text search)
- Collection-count endpoint
- REGULATION_SOURCES URL validation (IFRS, EFRAG, ENISA, NIST, OECD)
- Chunk-Browser collections, text search filtering, pagination state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 16:46:42 +01:00
Benjamin Admin
491df4e1b0 feat: add Chunk-Browser tab to RAG page
- New 'Chunk-Browser' tab for sequential chunk browsing
- Qdrant scroll API proxy (scroll + collection-count actions)
- Pagination with prev/next through all chunks in a collection
- Text search filter with highlighting
- Click to expand chunk and see all metadata
- 'In Chunks suchen' button now navigates to Chunk-Browser with correct collection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 09:35:52 +01:00
Benjamin Admin
954103cdf2 feat(ocr-pipeline): add Step 5 word recognition (grid from columns × rows)
Backend: build_word_grid() intersects column regions with content rows,
OCRs each cell with language-specific Tesseract, and returns vocabulary
entries with percent-based bounding boxes. New endpoints: POST /words,
GET /image/words-overlay, ground-truth save/retrieve for words.
Frontend: StepWordRecognition with overview + step-through labeling modes,
goToStep callback for row correction feedback loop.
MkDocs: OCR Pipeline documentation added.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 02:18:29 +01:00
Benjamin Admin
47dc2e6f7a feat(rag): source URLs, low-chunk warnings & IFRS/EFRAG entries
- Add REGULATION_SOURCES map with 88 original document URLs for all
  regulations (EUR-Lex, gesetze-im-internet.de, RIS, Fedlex, etc.)
- Render "Originalquelle →" link in regulation detail panel
- Add amber warning indicator for suspiciously low chunk counts (<10)
- Add EU_IFRS_DE, EU_IFRS_EN, EFRAG_ENDORSEMENT to RAG tracking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 01:56:09 +01:00
Benjamin Admin
b58aecd081 feat(ocr-pipeline): add Step 4 row detection UI in admin frontend
Insert rows step between columns and words in the pipeline wizard.
Shows overlay image, row list with type badges, and ground truth controls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 01:28:05 +01:00
Benjamin Admin
c7ae44ff17 feat(rag): add 42 new regulations to RAG overview + update collection totals
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 33s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 23s
New regulations across bp_compliance_ce (11), bp_compliance_gesetze (31),
and bp_compliance_datenschutz (1). Collection totals updated:
gesetze 58304, ce 18183, datenschutz 2448, total 103912.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 01:04:27 +01:00
Benjamin Admin
b03cb0a1e6 Fix Landkarte tab crash: variable name shadowed isInRag function
Local variables named 'isInRag' shadowed the outer function, causing
"isInRag is not a function" error. Renamed to regInRag/codeInRag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 00:01:01 +01:00
Benjamin Admin
5a45cbf605 Update RAG page: Chunks/Status columns use hardcoded data, Key Intersections show RAG status
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 24s
CI / test-python-klausur (push) Failing after 1m36s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 15s
- Chunks column now uses getKnownChunks() instead of API-based getRegulationChunks()
- Status column uses isInRag() check (green/red) instead of ratio-based calculation
- Key Intersections chips show green/red with checkmark/cross based on RAG status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:53:21 +01:00
Benjamin Admin
2297f66edb feat(rag): Add RAG status indicators and 4 new EU regulations
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m39s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 23s
- Add REGULATIONS_IN_RAG Set tracking all 42 regulations currently in Qdrant
- Add 4 new regulation entries: E-Commerce-RL, Verbraucherrechte-RL,
  Digitale-Inhalte-RL, DMA (all ingested Feb 2026)
- Add RAG column to regulations table with green check/red x indicators
- Update Landkarte tab: green/x on industry cards, thematic clusters,
  and regulation matrix
- Replace old "Integrated Regulations" section with full RAG coverage overview
- Update hardcoded chunk counts (Templates: 7689, NiBiS: 7996)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:23:52 +01:00
Benjamin Admin
587b066a40 feat(ocr-pipeline): ground-truth comparison tool for column detection
Side-by-side view: auto result (readonly) vs GT editor where teacher
draws correct columns. Diff table shows Auto vs GT with IoU matching.
GT data persisted per session for algorithm tuning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 22:48:37 +01:00
Benjamin Admin
bb879a03a8 feat(ocr-pipeline): add column_ignore type for margins/empty areas
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 08:51:56 +01:00
Benjamin Admin
f535d3c967 fix(ocr-pipeline): manual editor layout + no re-detection on cached result
- ManualColumnEditor now uses grid-cols-2 layout (image left, controls right)
  matching the normal view size so the image doesn't zoom in
- StepColumnDetection only runs auto-detection when no cached result exists;
  revisiting step 3 loads cached columns without re-running detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 08:45:49 +01:00
Benjamin Admin
7a3570fe46 feat(ocr-pipeline): manual column editor for Step 3
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 08:27:54 +01:00
Benjamin Admin
1393a994f9 Flexible inhaltsbasierte Spaltenerkennung (2-Phasen)
Ersetzt hardcodierte Positionsregeln durch ein zweistufiges System:
Phase A erkennt Spaltengeometrie (Clustering), Phase B klassifiziert
Typen per Inhalt (Sprache/Rolle) mit 3-stufiger Fallback-Kette.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 23:33:35 +01:00
Benjamin Admin
cf27a95308 feat(ocr-pipeline): word-based 5-column detection for vocabulary pages
Replace projection-profile layout analysis with Tesseract word bounding
box clustering to detect 5-column vocabulary layouts (page_ref, EN, DE,
markers, examples). Falls back to projection profiles when < 3 clusters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 23:08:14 +01:00
Benjamin Admin
aa06ae0f61 feat: Persistente Sessions (PostgreSQL) + Spaltenerkennung (Step 3)
Sessions werden jetzt in PostgreSQL gespeichert statt in-memory.
Neue Session-Liste mit Name, Datum, Schritt. Sessions ueberleben
Browser-Refresh und Container-Neustart. Step 3 nutzt analyze_layout()
fuer automatische Spaltenerkennung mit farbigem Overlay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:16:37 +01:00
Benjamin Admin
09b820efbe refactor(dewarp): replace displacement map with affine shear correction
The old displacement-map approach shifted entire rows by a parabolic
profile, creating a circle/barrel distortion. The actual problem is
a linear vertical shear: after deskew aligns horizontal lines, the
vertical column edges are still tilted by ~0.5°.

New approach:
- Detect shear angle from strongest vertical edge slope (not curvature)
- Apply cv2.warpAffine shear to straighten vertical features
- Manual slider: -2.0° to +2.0° in 0.05° steps
- Slider initializes to auto-detected shear angle
- Ground truth question: "Spalten vertikal ausgerichtet?"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 18:23:04 +01:00
Benjamin Admin
ff2bb79a91 fix(dewarp): change manual slider to percentage (0-200%) instead of raw multiplier
The old -3.0 to +3.0 scale multiplied the full displacement map (up to ~79px)
directly, causing extreme distortion at values >1. New slider:
- 0% = no correction
- 100% = auto-detected correction (default)
- 200% = double correction
- Step size: 5%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 18:10:34 +01:00
Benjamin Admin
9df745574b fix(ocr-pipeline): dewarp visibility, grid on both sides, session persistence
- Fix dewarp method selection: prefer methods with >5px curvature over
  higher confidence (vertical_edge 79px was being ignored for text_baseline 2px)
- Add grid overlay on left image in Dewarp step for side-by-side comparison
- Add GET /sessions/{id} endpoint to reload session data
- StepDeskew accepts sessionId prop to restore state when navigating back
- SessionInfo type extended with optional deskew_result and dewarp_result

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 17:29:53 +01:00
Benjamin Admin
44e8c573af fix: Deskew Ground Truth Frage auf Rotation beschraenken
"Korrekt ausgerichtet?" → "Rotation korrekt?" mit Hinweis,
dass Woelbung/Verzerrung im naechsten Schritt korrigiert wird.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 17:16:24 +01:00
Benjamin Admin
589d2f811a feat: Dewarp-Korrektur als Schritt 2 in OCR Pipeline (7 Schritte)
Implementiert Buchwoelbungs-Entzerrung mit zwei Methoden:
- Methode A: Vertikale-Kanten-Analyse (Sobel + Polynom 2. Grades)
- Methode B: Textzeilen-Baseline (Tesseract + Baseline-Kruemmung)
Beste Methode wird automatisch gewaehlt, manueller Slider (-3 bis +3).

Backend: 3 neue Endpoints (auto/manual dewarp, ground truth)
Frontend: StepDewarp + DewarpControls, Pipeline von 6 auf 7 Schritte

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 16:46:41 +01:00
Benjamin Admin
d552fd8b6b feat: OCR Pipeline mit 6-Schritt-Wizard fuer Seitenrekonstruktion
All checks were successful
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 38s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Successful in 1m46s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 22s
Neue Route /ai/ocr-pipeline mit schrittweiser Begradigung (Deskew),
Raster-Overlay und Ground Truth. Schritte 2-6 als Platzhalter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 15:38:08 +01:00
Benjamin Boenisch
6a53f8d79c refactor: Remove all SDK/compliance pages and API routes from admin-lehrer
SDK/compliance content belongs exclusively in admin-compliance (port 3007).
Removed:
- All (sdk)/ pages (document-crawler, dsb-portal, industry-templates, multi-tenant, sso)
- All api/sdk/ proxy routes
- All developers/sdk/ documentation pages
- Unused lib/sdk/ modules (kept: catalog-manager + its deps for dashboard)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 09:24:36 +01:00