Commit Graph

54 Commits

Author SHA1 Message Date
Benjamin Admin
3028f421b4 feat(ocr-pipeline): add cell text noise filter for OCR artifacts
Add _clean_cell_text() with three sub-filters to remove OCR noise:
- _is_garbage_text(): vowel/consonant ratio check for phantom row garbage
- _is_noise_tail_token(): dictionary-based trailing noise detection
- _RE_REAL_WORD check for cells with no real words (just fragments)

Handles balanced parentheses "(auf)" and trailing hyphens "under-"
as legitimate tokens while stripping noise like "Es)", "3", "ee", "B".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:19:31 +01:00
Benjamin Admin
2b1c499d54 fix(ocr-pipeline): filter OCR noise from image areas and artifacts
Two generic noise filters added to _ocr_single_cell():

1. Word confidence filter (conf < 30): removes low-confidence words
   before text assembly.  Catches trailing artifacts like "Es)" after
   real text, and standalone noise from image edges.

2. Cell noise filter: clears cells whose entire text has no real
   alphabetic word (>= 2 letters).  Catches fragments like "E:", "3",
   "u", "D", "2.77", "and )" from image areas, while keeping real
   short words like "Ei", "go", "an".

Both filters apply to word-lookup AND cell-OCR fallback results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:56:54 +01:00
Benjamin Admin
72cc77dcf4 fix(ocr-pipeline): cells = result, no post-processing content shuffling
The cell grid IS the result. Each cell stays at its detected position.

Removed _split_comma_entries and _attach_example_sentences from the
pipeline — they were shuffling content between rows/columns, causing
"Mäuse" to appear in a separate row, "stand..." to move to Example,
and "Ei" to disappear.

Now: cells → _cells_to_vocab_entries (1:1 row mapping) →
_fix_character_confusion → _fix_phonetic_brackets → done.

Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR
fallback so small text like "Ei" is not filtered out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:41:30 +01:00
Benjamin Admin
e3f939a628 refactor(ocr-pipeline): make post-processing fully generic
Three non-generic solutions replaced with universal heuristics:

1. Cell-OCR fallback: instead of restricting to column_en/column_de,
   now checks pixel density (>2% dark pixels) for ANY column type.
   Truly empty cells are skipped without running Tesseract.

2. Example-sentence detection: instead of checking for example-column
   text (worksheet-specific), now uses sentence heuristics (>=4 words
   or ends with sentence punctuation). Short EN text without DE is
   kept as a vocab entry (OCR may have missed the translation).

3. Comma-split: re-enabled with singular/plural detection. Pairs like
   "mouse, mice" / "Maus, Mäuse" are kept together. Verb forms like
   "break, broke, broken" are still split into individual entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:27:30 +01:00
Benjamin Admin
6bca3370e0 fix(ocr-pipeline): fix vocab post-processing destroying correct cell results
Three bugs in the post-processing pipeline were overwriting correct
streaming results with wrong ones:

1. _split_comma_entries was splitting "Maus, Mäuse" into two separate
   entries. Disabled — word forms belong together.

2. _attach_example_sentences treated "Ei" (2 chars) as OCR noise due
   to `len(de) > 2` threshold. Lowered to `len(de) > 1`.

3. _attach_example_sentences wrongly classified rows with EN text but
   no DE (like "stand ...") as example sentences, merging them into
   the previous entry. Now only treats rows as examples if they also
   have no text in the example column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:16:50 +01:00
Benjamin Admin
befc44d2dd perf(ocr-pipeline): limit cell-OCR fallback to EN/DE columns only
Skip Tesseract fallback for column_example cells which are often
legitimately empty.  This reduces ~48 Tesseract calls to ~10,
cutting Step 5 fallback time from ~13s to ~3s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:01:08 +01:00
Benjamin Admin
8f2c2e8f68 feat(ocr-pipeline): hybrid word-lookup with cell-OCR fallback
Word-lookup from full-page Tesseract is fast but can miss small or
isolated words (e.g. "Ei"). Now falls back to per-cell Tesseract OCR
for cells that remain empty after word-lookup. The ocr_engine field
reports 'cell_ocr_fallback' for cells that needed the fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 08:21:12 +01:00
Benjamin Admin
2c4160e4c4 fix(ocr-pipeline): exclusive word-to-column assignment prevents duplicates
Replace per-cell word filtering (which allowed the same word to appear in
multiple columns due to padded overlap) with exclusive nearest-center
assignment. Each word is assigned to exactly one column per row.

Also use row height as Y-tolerance for text assembly so words within
the same row (e.g. "Maus, Mäuse") are always grouped on one line.

Fixes: words leaking into wrong columns, missing words, duplicate words.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:54:45 +01:00
Benjamin Admin
9bbde1c03e fix(ocr-pipeline): re-populate row.words for word-lookup in Step 5
The row_result stored in DB excludes words to keep payload small.
When Step 5 reconstructs RowGeometry from DB, words were empty,
causing word-lookup to find nothing and return blank cells.

Now re-populates row.words from cached _word_dicts (or re-runs
detect_column_geometry if cache is cold) before cell grid building.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:38:33 +01:00
Benjamin Admin
77869e32f4 feat(ocr-pipeline): use word-lookup instead of cell-OCR for cell grid
Replace per-cell Tesseract re-runs with lookup of pre-existing full-page
words from row.words. Words are filtered by X-overlap with column bounds.
This fixes phantom rows with garbage text, missing last words, and
incomplete example text by using the more reliable full-page OCR results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:24:46 +01:00
Benjamin Admin
89b5f49918 fix(ocr-pipeline): filter phantom rows with word_count=0 from cell grid
Rows in inter-line whitespace gaps have no Tesseract words assigned but
were still processed by build_cell_grid, producing garbage OCR output.
Filter these phantom rows using the word_count field set during Step 4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 18:40:13 +01:00
Benjamin Admin
7f27783008 feat(ocr-pipeline): add SSE streaming for word recognition (Step 5)
Cells now appear one-by-one in the UI as they are OCR'd, with a live
progress bar, instead of waiting for the full result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:54:20 +01:00
Benjamin Admin
a666e883da fix(ocr-pipeline): exclude header/footer/page_ref from cell grid columns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:33:48 +01:00
Benjamin Admin
27b895a848 feat(ocr-pipeline): generic cell-grid with optional vocab mapping
Extract build_cell_grid() as layout-agnostic foundation from
build_word_grid(). Step 5 now produces a generic cell grid (columns x
rows) and auto-detects whether vocab layout is present. Frontend
dynamically switches between vocab table (EN/DE/Example) and generic
cell table based on layout type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:22:56 +01:00
Benjamin Admin
3bcb7aa638 fix(ocr-pipeline): remove overzealous grid row count validation
The validation that rejected word-center grid when it produced more rows
than gap-based detection was causing fallback to gap-based rows (large
boxes). The word-center grid regularization works correctly after the
center-based grouping and cluster merging fixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 13:01:27 +01:00
Benjamin Admin
c4f2e6554e fix(ocr-pipeline): prevent grid from producing more rows than gap-based
Two fixes:
1. Grid validation: reject word-center grid if it produces MORE rows
   than gap-based detection (more rows = lines were split = worse).
   Falls back to gap-based rows in that case.

2. Words overlay: draw clean grid cells (column × row intersections)
   instead of padded entry bboxes. Eliminates confusing double lines.
   OCR text labels are placed inside the grid cells directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:52:41 +01:00
Benjamin Admin
8e861e5a4d fix(ocr-pipeline): use gap-based row height for cluster tolerance
The y_tolerance for word-center clustering was based on median word
height (21px → 12px tolerance), which was too small. Words on the
same line can have centers 15-20px apart due to different heights.

Now uses 40% of the gap-based median row height as tolerance (e.g.
40px row → 16px tolerance), and 30% for merge threshold. This
produces correct cluster counts matching actual text lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:34:15 +01:00
Benjamin Admin
4970ca903e fix(ocr-pipeline): invalidate downstream results when steps are re-run
When columns change (Step 3), invalidate row_result and word_result.
When rows change (Step 4), invalidate word_result.
This ensures Step 5 always uses the latest row boundaries instead of
showing stale cached word_result from a previous run.

Applies to both auto-detection and manual override endpoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:24:44 +01:00
Benjamin Admin
97d4355aa9 fix(ocr-pipeline): group words by vertical center, merge close clusters
Fix half-height rows caused by tall special characters (brackets, IPA
symbols) being split into separate line clusters:

- Group words by vertical CENTER instead of TOP position, so tall
  characters on the same line stay in one cluster
- Filter outlier-height words (>2× median) when computing letter_h
  so brackets/IPA don't skew the row height
- Merge clusters closer than 0.4× median word height (definitely
  same text line despite slight center differences)
- Increased y_tolerance from 0.5× to 0.6× median word height
- Enhanced logging with cluster merge count and row height range

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:14:42 +01:00
Benjamin Admin
8ad5823fd8 feat(ocr-pipeline): word-center grid with section-break detection
Replace rigid uniform grid with bottom-up approach that derives row
boundaries from word vertical centers:
- Group words into line clusters, compute center_y per cluster
- Compute pitch (distance between consecutive centers)
- Detect section breaks where gap > 1.8× median pitch
- Place row boundaries at midpoints between consecutive centers
- Per-section local pitch adapts to heading/paragraph spacing
- Validate ≥85% word placement, fallback to gap-based rows

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:04:08 +01:00
Benjamin Admin
ec47045c15 feat(ocr-pipeline): uniform grid regularization for row detection (Step 7)
Replace _split_oversized_rows() with _regularize_row_grid(). When ≥60%
of content rows have consistent height (±25% of median), overlay a
uniform grid with the standard row height over the entire content area.
This leverages the fact that books/vocab lists use constant row heights.

Validates grid by checking ≥85% of words land in a grid row. Falls back
to gap-based rows if heights are too irregular or words don't fit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:50:50 +01:00
Benjamin Admin
ba65e47654 feat(ocr-pipeline): move oversized row splitting from Step 5 to Step 4
Implement _split_oversized_rows() in detect_row_geometry() (Step 7) to
split content rows >1.5× median height using local horizontal projection.
This produces correctly-sized rows before word OCR runs, instead of
working around the issue in Step 5 with sub-cell splitting hacks.

Removed Step 5 workarounds: _split_oversized_entries(), sub-cell
splitting in build_word_grid(), and median_row_h calculation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:46:18 +01:00
Benjamin Admin
8507e2e035 fix(ocr-pipeline): split oversized cells before OCR to capture all text
For cells taller than 1.5× median row height, split vertically into
sub-cells and OCR each separately. This fixes RapidOCR losing text
at the bottom of tall cells (e.g. "floor/Fußboden" below "egg/Ei"
in a merged row). Generic fix — works for any oversized cell.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:32:10 +01:00
Benjamin Admin
f2521d2b9e feat(ocr-pipeline): British/American IPA pronunciation choice
- Integrate Britfone dictionary (MIT, 15k British English IPA entries)
- Add pronunciation parameter: 'british' (default) or 'american'
- British uses Britfone (Received Pronunciation), falls back to CMU
- American uses eng_to_ipa/CMU, falls back to Britfone
- Frontend: dropdown to switch pronunciation, default = British
- API: ?pronunciation=british|american query parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:08:52 +01:00
Benjamin Admin
010616be5a fix(ocr-pipeline): generic example attachment + cell padding
1. Semantic example matching: instead of attaching example sentences
   to the immediately preceding entry, find the vocab entry whose
   English word(s) appear in the example. "a broken arm" → matches
   "broken" via word overlap, not "egg/Ei". Uses stem matching for
   word form variants (break/broken share stem "bro").

2. Cell padding: add 8px padding to each cell region so words at
   column/row edges don't get clipped by OCR (fixes "er wollte"
   missing at cell boundaries).

3. Treat very short DE text (≤2 chars) as OCR noise, not real
   translation — prevents false positives in example detection.

All fixes are generic and deterministic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:24:28 +01:00
Benjamin Admin
ab294d5a6f feat(ocr-pipeline): deterministic post-processing pipeline
Add 4 post-processing steps after OCR (no LLM needed):

1. Character confusion fix: I/1/l/| correction using cross-language
   context (if DE has "Ich", EN "1" → "I")
2. IPA dictionary replacement: detect [phonetics] brackets, look up
   correct IPA from eng_to_ipa (MIT, 134k words) — replaces OCR'd
   phonetic symbols with dictionary-correct transcription
3. Comma-split: "break, broke, broken" / "brechen, brach, gebrochen"
   → 3 individual entries when part counts match
4. Example sentence attachment: rows with EN but no DE translation
   get attached as examples to the preceding vocab entry

All fixes are deterministic and generic — no hardcoded word lists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:00:09 +01:00
Benjamin Admin
d481e0087b deps: add eng-to-ipa for IPA dictionary lookup
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 20:23:40 +01:00
Benjamin Admin
f7e0f2bb4f feat(ocr-pipeline): line breaks, hyphen rejoin & oversized row splitting
- Preserve \n between visual lines within cells (instead of joining with space)
- Rejoin hyphenated words split across line breaks (e.g. Fuß-\nboden → Fußboden)
- Split oversized rows (>1.5× median height) into sub-entries when EN/DE
  line counts match — deterministic fix for missed Step 4 row boundaries
- Frontend: render \n as <br/>, use textarea for multiline editing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 18:49:28 +01:00
Benjamin Admin
859342300e fix(ocr-pipeline): configure RapidOCR for German + tighter word detection
- Switch to PP-OCRv5 Latin model (supports ä, ö, ü, ß)
- Use SERVER model for better accuracy
- Lower Det.unclip_ratio 1.6→1.3 to reduce word merging
- Raise Det.box_thresh 0.5→0.6 for stricter detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 18:17:49 +01:00
Benjamin Admin
984dfab975 fix(ocr-pipeline): add libgl1 for RapidOCR OpenCV dependency
RapidOCR pulls in full opencv-python which requires libGL.so.1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 17:30:12 +01:00
Benjamin Admin
45435f226f feat(ocr-pipeline): line grouping fix + RapidOCR integration
Fix A: Use _group_words_into_lines() with adaptive Y-tolerance to
correctly order words in multi-line cells (fixes word reordering bug).

RapidOCR: Add as alternative OCR engine (PaddleOCR models on ONNX
Runtime, native ARM64). Engine selectable via dropdown in UI or
?engine= query param. Auto mode prefers RapidOCR when available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 17:13:58 +01:00
Benjamin Admin
4ec7c20490 feat(ocr-pipeline): add rapidocr + onnxruntime to requirements
RapidOCR uses PaddleOCR models on ONNX Runtime, works natively on ARM64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 17:08:21 +01:00
Benjamin Admin
356d39d6ee fix(ocr-pipeline): use PSM 6 (block) for multi-line cell OCR in word grid
PSM 7 (single line) missed the second line in cells with two lines.
PSM 6 handles multi-line content. Also fix sort order to Y-then-X
for correct reading order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 09:40:04 +01:00
Benjamin Admin
954103cdf2 feat(ocr-pipeline): add Step 5 word recognition (grid from columns × rows)
Backend: build_word_grid() intersects column regions with content rows,
OCRs each cell with language-specific Tesseract, and returns vocabulary
entries with percent-based bounding boxes. New endpoints: POST /words,
GET /image/words-overlay, ground-truth save/retrieve for words.
Frontend: StepWordRecognition with overview + step-through labeling modes,
goToStep callback for row correction feedback loop.
MkDocs: OCR Pipeline documentation added.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 02:18:29 +01:00
Benjamin Admin
203b3c0e2d fix(ocr-pipeline): mask out images in row detection horizontal projection
Build a word-coverage mask so only pixels near Tesseract word bounding
boxes contribute to the horizontal projection. Image regions (high ink
but no words) are treated as white, preventing illustrations from
merging multiple vocabulary rows into one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 01:39:20 +01:00
Benjamin Admin
04b83d5f46 feat(ocr-pipeline): add row detection step with horizontal gap analysis
Add Step 4 (row detection) between column detection and word recognition.
Uses horizontal projection profiles + whitespace gaps (same method as columns).
Includes header/footer classification via gap-size heuristics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 01:14:31 +01:00
Benjamin Admin
ce0815007e feat(ocr-pipeline): replace clustering column detection with whitespace-gap analysis
Column detection now uses vertical projection profiles to find whitespace
gaps between columns, then validates gaps against word bounding boxes to
prevent splitting through words. Old clustering algorithm extracted as
fallback (_detect_columns_by_clustering) for pages with < 2 detected gaps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 00:36:28 +01:00
Benjamin Admin
164b35c06a fix(ocr-pipeline): tighten page_ref constraints based on live testing
- Reduce left-side threshold from 35% to 20% of content width
- Strong language signal (eng/deu > 0.3) now prevents page_ref assignment
- Increase column_ignore word threshold from 3 to 8 for edge columns
- Apply language guard to Level 1 and Level 2 classification

Fixes: column with deu=0.921 was misclassified as page_ref because
reference score check ran before language analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:33:11 +01:00
Benjamin Admin
db8327f039 fix(ocr-pipeline): tune column detection based on GT comparison
Address 5 weaknesses found via ground-truth comparison on session df3548d1:
- Add column_ignore for edge columns with < 3 words (margin detection)
- Absorb tiny clusters (< 5% width) into neighbors post-merge
- Restrict page_ref to left 35% of content area across all 3 levels
- Loosen marker thresholds (width < 6%, words <= 15) and add strong
  marker score for very narrow non-edge columns (< 4%)
- Add EN/DE position tiebreaker when language signals are both weak

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:16:31 +01:00
Benjamin Admin
587b066a40 feat(ocr-pipeline): ground-truth comparison tool for column detection
Side-by-side view: auto result (readonly) vs GT editor where teacher
draws correct columns. Diff table shows Auto vs GT with IoU matching.
GT data persisted per session for algorithm tuning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 22:48:37 +01:00
Benjamin Admin
03fa186fec fix(ocr-pipeline): increase merge distance to 6% for better column merging
Sub-alignments within a column (indented words, etc.) were 60-90px apart
and not getting merged at 3%. On a typical 5-col page (~1500px), 6% = ~90px
merges sub-alignments while keeping real column boundaries (~300px) separate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 20:19:09 +01:00
Benjamin Admin
1040729874 fix(ocr-pipeline): avoid backslash in f-string for Python 3.11 compat
Use format() instead of nested f-strings with escaped quotes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 20:06:20 +01:00
Benjamin Admin
4f37afa222 feat(ocr-pipeline): verticality filter for column detection
Clusters now track Y-positions of their words and filter by vertical
coverage (>=30% primary, >=15%+5words secondary) to reject noise from
indentations or page numbers. Merge distance widened to 3% content width.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 19:57:13 +01:00
Benjamin Admin
bb879a03a8 feat(ocr-pipeline): add column_ignore type for margins/empty areas
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 08:51:56 +01:00
Benjamin Admin
1393a994f9 Flexible inhaltsbasierte Spaltenerkennung (2-Phasen)
Ersetzt hardcodierte Positionsregeln durch ein zweistufiges System:
Phase A erkennt Spaltengeometrie (Clustering), Phase B klassifiziert
Typen per Inhalt (Sprache/Rolle) mit 3-stufiger Fallback-Kette.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 23:33:35 +01:00
Benjamin Admin
cf27a95308 feat(ocr-pipeline): word-based 5-column detection for vocabulary pages
Replace projection-profile layout analysis with Tesseract word bounding
box clustering to detect 5-column vocabulary layouts (page_ref, EN, DE,
markers, examples). Falls back to projection profiles when < 3 clusters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 23:08:14 +01:00
Benjamin Admin
aa06ae0f61 feat: Persistente Sessions (PostgreSQL) + Spaltenerkennung (Step 3)
Sessions werden jetzt in PostgreSQL gespeichert statt in-memory.
Neue Session-Liste mit Name, Datum, Schritt. Sessions ueberleben
Browser-Refresh und Container-Neustart. Step 3 nutzt analyze_layout()
fuer automatische Spaltenerkennung mit farbigem Overlay.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 22:16:37 +01:00
Benjamin Admin
09b820efbe refactor(dewarp): replace displacement map with affine shear correction
The old displacement-map approach shifted entire rows by a parabolic
profile, creating a circle/barrel distortion. The actual problem is
a linear vertical shear: after deskew aligns horizontal lines, the
vertical column edges are still tilted by ~0.5°.

New approach:
- Detect shear angle from strongest vertical edge slope (not curvature)
- Apply cv2.warpAffine shear to straighten vertical features
- Manual slider: -2.0° to +2.0° in 0.05° steps
- Slider initializes to auto-detected shear angle
- Ground truth question: "Spalten vertikal ausgerichtet?"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 18:23:04 +01:00
Benjamin Admin
ff2bb79a91 fix(dewarp): change manual slider to percentage (0-200%) instead of raw multiplier
The old -3.0 to +3.0 scale multiplied the full displacement map (up to ~79px)
directly, causing extreme distortion at values >1. New slider:
- 0% = no correction
- 100% = auto-detected correction (default)
- 200% = double correction
- Step size: 5%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 18:10:34 +01:00
Benjamin Admin
fb496c5e34 perf(klausur-service): split Dockerfile into base + app layer
Tesseract OCR + 70 Debian packages + pip dependencies are now in a
separate base image (klausur-base:latest) that is built once and reused.
A --no-cache build now only rebuilds the code layer (~seconds) instead
of re-downloading 33 MB of system packages (~9 minutes).

Rebuild base when requirements.txt or system deps change:
  docker build -f klausur-service/Dockerfile.base -t klausur-base:latest klausur-service/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 17:43:24 +01:00