- Document Chunk-Browser tab functionality and API
- Cover scroll endpoint, text search, pagination
- Document Originalquelle links and low-chunk warnings
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PSM 7 (single line) missed the second line in cells with two lines.
PSM 6 handles multi-line content. Also fix sort order to Y-then-X
for correct reading order.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New 'Chunk-Browser' tab for sequential chunk browsing
- Qdrant scroll API proxy (scroll + collection-count actions)
- Pagination with prev/next through all chunks in a collection
- Text search filter with highlighting
- Click to expand chunk and see all metadata
- 'In Chunks suchen' button now navigates to Chunk-Browser with correct collection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend: build_word_grid() intersects column regions with content rows,
OCRs each cell with language-specific Tesseract, and returns vocabulary
entries with percent-based bounding boxes. New endpoints: POST /words,
GET /image/words-overlay, ground-truth save/retrieve for words.
Frontend: StepWordRecognition with overview + step-through labeling modes,
goToStep callback for row correction feedback loop.
MkDocs: OCR Pipeline documentation added.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add REGULATION_SOURCES map with 88 original document URLs for all
regulations (EUR-Lex, gesetze-im-internet.de, RIS, Fedlex, etc.)
- Render "Originalquelle →" link in regulation detail panel
- Add amber warning indicator for suspiciously low chunk counts (<10)
- Add EU_IFRS_DE, EU_IFRS_EN, EFRAG_ENDORSEMENT to RAG tracking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build a word-coverage mask so only pixels near Tesseract word bounding
boxes contribute to the horizontal projection. Image regions (high ink
but no words) are treated as white, preventing illustrations from
merging multiple vocabulary rows into one.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Insert rows step between columns and words in the pipeline wizard.
Shows overlay image, row list with type badges, and ground truth controls.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Step 4 (row detection) between column detection and word recognition.
Uses horizontal projection profiles + whitespace gaps (same method as columns).
Includes header/footer classification via gap-size heuristics.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New regulations across bp_compliance_ce (11), bp_compliance_gesetze (31),
and bp_compliance_datenschutz (1). Collection totals updated:
gesetze 58304, ce 18183, datenschutz 2448, total 103912.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column detection now uses vertical projection profiles to find whitespace
gaps between columns, then validates gaps against word bounding boxes to
prevent splitting through words. Old clustering algorithm extracted as
fallback (_detect_columns_by_clustering) for pages with < 2 detected gaps.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Local variables named 'isInRag' shadowed the outer function, causing
"isInRag is not a function" error. Renamed to regInRag/codeInRag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chunks column now uses getKnownChunks() instead of API-based getRegulationChunks()
- Status column uses isInRag() check (green/red) instead of ratio-based calculation
- Key Intersections chips show green/red with checkmark/cross based on RAG status
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reduce left-side threshold from 35% to 20% of content width
- Strong language signal (eng/deu > 0.3) now prevents page_ref assignment
- Increase column_ignore word threshold from 3 to 8 for edge columns
- Apply language guard to Level 1 and Level 2 classification
Fixes: column with deu=0.921 was misclassified as page_ref because
reference score check ran before language analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add REGULATIONS_IN_RAG Set tracking all 42 regulations currently in Qdrant
- Add 4 new regulation entries: E-Commerce-RL, Verbraucherrechte-RL,
Digitale-Inhalte-RL, DMA (all ingested Feb 2026)
- Add RAG column to regulations table with green check/red x indicators
- Update Landkarte tab: green/x on industry cards, thematic clusters,
and regulation matrix
- Replace old "Integrated Regulations" section with full RAG coverage overview
- Update hardcoded chunk counts (Templates: 7689, NiBiS: 7996)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address 5 weaknesses found via ground-truth comparison on session df3548d1:
- Add column_ignore for edge columns with < 3 words (margin detection)
- Absorb tiny clusters (< 5% width) into neighbors post-merge
- Restrict page_ref to left 35% of content area across all 3 levels
- Loosen marker thresholds (width < 6%, words <= 15) and add strong
marker score for very narrow non-edge columns (< 4%)
- Add EN/DE position tiebreaker when language signals are both weak
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Side-by-side view: auto result (readonly) vs GT editor where teacher
draws correct columns. Diff table shows Auto vs GT with IoU matching.
GT data persisted per session for algorithm tuning.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sub-alignments within a column (indented words, etc.) were 60-90px apart
and not getting merged at 3%. On a typical 5-col page (~1500px), 6% = ~90px
merges sub-alignments while keeping real column boundaries (~300px) separate.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clusters now track Y-positions of their words and filter by vertical
coverage (>=30% primary, >=15%+5words secondary) to reject noise from
indentations or page numbers. Merge distance widened to 3% content width.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ManualColumnEditor now uses grid-cols-2 layout (image left, controls right)
matching the normal view size so the image doesn't zoom in
- StepColumnDetection only runs auto-detection when no cached result exists;
revisiting step 3 loads cached columns without re-running detection
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ersetzt hardcodierte Positionsregeln durch ein zweistufiges System:
Phase A erkennt Spaltengeometrie (Clustering), Phase B klassifiziert
Typen per Inhalt (Sprache/Rolle) mit 3-stufiger Fallback-Kette.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace projection-profile layout analysis with Tesseract word bounding
box clustering to detect 5-column vocabulary layouts (page_ref, EN, DE,
markers, examples). Falls back to projection profiles when < 3 clusters.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sessions werden jetzt in PostgreSQL gespeichert statt in-memory.
Neue Session-Liste mit Name, Datum, Schritt. Sessions ueberleben
Browser-Refresh und Container-Neustart. Step 3 nutzt analyze_layout()
fuer automatische Spaltenerkennung mit farbigem Overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old displacement-map approach shifted entire rows by a parabolic
profile, creating a circle/barrel distortion. The actual problem is
a linear vertical shear: after deskew aligns horizontal lines, the
vertical column edges are still tilted by ~0.5°.
New approach:
- Detect shear angle from strongest vertical edge slope (not curvature)
- Apply cv2.warpAffine shear to straighten vertical features
- Manual slider: -2.0° to +2.0° in 0.05° steps
- Slider initializes to auto-detected shear angle
- Ground truth question: "Spalten vertikal ausgerichtet?"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old -3.0 to +3.0 scale multiplied the full displacement map (up to ~79px)
directly, causing extreme distortion at values >1. New slider:
- 0% = no correction
- 100% = auto-detected correction (default)
- 200% = double correction
- Step size: 5%
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract OCR + 70 Debian packages + pip dependencies are now in a
separate base image (klausur-base:latest) that is built once and reused.
A --no-cache build now only rebuilds the code layer (~seconds) instead
of re-downloading 33 MB of system packages (~9 minutes).
Rebuild base when requirements.txt or system deps change:
docker build -f klausur-service/Dockerfile.base -t klausur-base:latest klausur-service/
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix dewarp method selection: prefer methods with >5px curvature over
higher confidence (vertical_edge 79px was being ignored for text_baseline 2px)
- Add grid overlay on left image in Dewarp step for side-by-side comparison
- Add GET /sessions/{id} endpoint to reload session data
- StepDeskew accepts sessionId prop to restore state when navigating back
- SessionInfo type extended with optional deskew_result and dewarp_result
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"Korrekt ausgerichtet?" → "Rotation korrekt?" mit Hinweis,
dass Woelbung/Verzerrung im naechsten Schritt korrigiert wird.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Neue Route /ai/ocr-pipeline mit schrittweiser Begradigung (Deskew),
Raster-Overlay und Ground Truth. Schritte 2-6 als Platzhalter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- edu-search-service von breakpilot-pwa nach breakpilot-lehrer kopiert (ohne vendor)
- opensearch + edu-search-service in docker-compose.yml hinzugefuegt
- voice-service aus docker-compose.yml entfernt (jetzt in breakpilot-core)
- geo-service aus docker-compose.yml entfernt (nicht mehr benoetigt)
- CI/CD: edu-search-service zu Gitea Actions und Woodpecker hinzugefuegt
(Go lint, test mit go mod download, build, SBOM)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The formula int(17 - grade*3) was off by one for grades 1.0 and 1.3
due to integer truncation. Added +1 adjustment for grades < 2.0 to
match the intended mapping (1.0=15, 1.3=14).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The act_runner cannot create /home/act_runner cache dir inside
container images. Replace actions/checkout@v4 with manual
git clone using GITHUB_SERVER_URL and GITHUB_REPOSITORY env vars.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds .gitea/workflows/ci.yaml with lint and test jobs.
Runs on gitea.meghsakha.com with Gitea Actions runner.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Der Geo-Service (DSGVO-konforme Geografie-Lernplattform) war ein
nie genutzter Prototyp und wird nicht mehr benoetigt. Entfernt aus:
- Quellcode (geo-service/)
- CI/CD Pipeline (test, build, sbom, deploy steps)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- session_manager: add session to _local_cache in create_session()
so get_session() and get_active_sessions() work without Redis/Postgres
- message_bus: use asyncio.create_task() in request() for publish
so handler runs concurrently and timeout actually fires
All 76 tests pass now.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add complete service table with containers, ports, and tech stack
- Add Core dependency table
- Add URLs section with Lehrer-Tools
- Add deployment and git instructions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>