Commit Graph

467 Commits

Author SHA1 Message Date
Benjamin Admin
7252f9a956 refactor(ocr-pipeline): use left-edge alignment approach for sub-column detection
Replace gap-based splitting with alignment-bin approach: cluster word
left-edges within 8px tolerance, find the leftmost bin with >= 10% of
words as the true column start, split off any words to its left as a
sub-column. This correctly handles both page references ("p.59") and
misread exclamation marks ("!" → "I") even when the pixel gap is small.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:56:38 +01:00
Benjamin Admin
f13116345b fix(tests): use correct bbox_pct dict format in _cells_to_vocab_entries tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:26:24 +01:00
Benjamin Admin
991984d9c3 fix(tests): pass columns_meta arg to _cells_to_vocab_entries tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:23:55 +01:00
Benjamin Admin
1a246eb059 feat(ocr-pipeline): generic sub-column detection via left-edge clustering
Detects hidden sub-columns (e.g. page references like "p.59") within
already-recognized columns by clustering word left-edge positions and
splitting when a clear minority cluster exists. The sub-column is then
classified as page_ref and mapped to VocabRow.source_page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 18:18:02 +01:00
Benjamin Admin
0532b2a797 fix(ocr-pipeline): skip edge-touching gaps in header/footer detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Gaps that extend to the image boundary (top/bottom edge) are not valid
content separators — they typically represent dewarp padding. Only gaps
with content on both sides qualify as header/footer boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 17:54:49 +01:00
Benjamin Admin
f1fcc67357 fix(ocr-pipeline): clamp gap detection to img_h to avoid dewarp padding
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m46s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
The inverted image can be taller than img_h after dewarp shear
correction, causing footer_y to be detected outside the visible page.
Now clamps the horizontal projection to actual_h = min(inv.shape[0], img_h).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 17:06:58 +01:00
Benjamin Admin
c8981423d4 feat(ocr-pipeline): distinguish header/footer vs margin_top/margin_bottom
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
Check for actual ink content in detected top/bottom regions:
- 'header'/'footer' when text is present (e.g. title, page number)
- 'margin_top'/'margin_bottom' when the region is empty page margin

Also update all skip-type sets and color maps for the new types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:55:41 +01:00
Benjamin Admin
f615c5f66d feat(ocr-pipeline): generic header/footer detection via projection gap analysis
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 16s
Replace the trivial top_y/bottom_y threshold check with horizontal
projection gap analysis that finds large whitespace gaps separating
header/footer content from the main body. This correctly detects
headers (e.g. "VOCABULARY" banners) and footers (page numbers) even
when _find_content_bounds includes them in the content area.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 16:13:48 +01:00
Benjamin Admin
a052f73de3 fix(ocr-pipeline): pass left_x/right_x to classify_column_types in API path
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m45s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
The ocr_pipeline_api.py code path called classify_column_types without
left_x/right_x, so margin regions were never created. Also add logging
to _build_margin_regions for debugging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 15:42:39 +01:00
Benjamin Admin
34ccdd5fd1 feat(ocr-pipeline): filter scan artifacts in content bounds and add margin regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m50s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Thin black lines (1-5px) at page edges from scanning were incorrectly
detected as content, shifting content bounds and creating spurious
IGNORE columns. This filters narrow projection runs (<1% of image
dimension) and introduces explicit margin_left/margin_right regions
for downstream page reconstruction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 15:29:18 +01:00
Benjamin Admin
e718353d9f feat(ocr-pipeline): 6 systematic improvements for robustness, performance & UX
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 21s
1. Unit tests: 76 new parametrized tests for noise filter, phonetic detection,
   cell text cleaning, and row merging (116 total, all green)
2. Continuation-row merge: detect multi-line vocab entries where text wraps
   (lowercase EN + empty DE) and merge into previous entry
3. Empty DE fallback: secondary PSM=7 OCR pass for cells missed by PSM=6
4. Batch-OCR: collect empty cells per column, run single Tesseract call on
   column strip instead of per-cell (~66% fewer calls for 3+ empty cells)
5. StepReconstruction UI: font scaling via naturalHeight, empty EN/DE field
   highlighting, undo/redo (Ctrl+Z), per-cell reset button
6. Session reprocess: POST /sessions/{id}/reprocess endpoint to re-run from
   any step, with reprocess button on completed pipeline steps

Also fixes pre-existing dewarp_image tuple unpacking bug in run_cv_pipeline
and updates dewarp tests to match current (image, info) return signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:46:38 +01:00
Benjamin Admin
c3a924a620 fix(ocr-pipeline): merge phonetic-only rows and fix bracket noise filter
Two fixes:
1. Tokens ending with ] (e.g. "serva]") were stripped by the noise
   filter because ] was not in the allowed punctuation list.
2. Rows containing only phonetic transcription (e.g. ['mani serva])
   are now merged into the previous vocab entry instead of creating
   a separate (invalid) entry. This prevents the LLM from trying
   to "correct" phonetic fragments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 14:14:20 +01:00
Benjamin Admin
650f15bc1b fix(ocr-pipeline): tolerate dictionary punctuation in noise filter
The noise filter was stripping words containing hyphens, parentheses,
slashes, and dots (e.g. "money-saver", "Schild(chen)", "(Salat-)Gurke",
"Tanz(veranstaltung)"). Now strips all common dictionary punctuation
before checking for internal noise characters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 13:12:40 +01:00
Benjamin Admin
40a77a82f6 fix(ocr-pipeline): use midpoint boundaries for column word assignment
Replace containment-with-padding approach with midpoint-based column
ranges. For adjacent columns, the assignment boundary is the midpoint
between them (Voronoi-style). This prevents padding overlap where words
near column borders (e.g. "We" at the start of example sentences) were
assigned to the preceding column. The last column extends generously to
capture all rightmost text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:53:56 +01:00
Benjamin Admin
87931c35e4 fix(ocr-pipeline): stop noise filter from stripping parenthesized words
_is_noise_tail_token() treated words with unbalanced parentheses like
"selbst)" or "(wir" as OCR noise because the parenthesis counted as
"internal noise". Now strips leading/trailing parentheses before the
noise check, so legitimate words in example sentences like
"We baked ... (wir ... selbst)" are preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:51:28 +01:00
Benjamin Admin
29b1d95acc fix(ocr-pipeline): improve word-column assignment and LLM review accuracy
Word assignment: Replace nearest-center-distance with containment-first
strategy. Words whose center falls within a column's bounds (+ 15% pad)
are assigned to that column before falling back to nearest-center. This
fixes long example sentences losing their rightmost words to adjacent
columns.

LLM review: Strengthen prompt to explicitly forbid changing proper nouns,
place names, and correctly-spelled words. Add _is_spurious_change()
post-filter that rejects case-only changes and hallucinated word
replacements (< 50% character overlap).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:40:26 +01:00
Benjamin Admin
dbf0db0c13 feat(ocr-pipeline): improve LLM review UI + add reconstruction step
StepLlmReview: Show full vocab table with image overlay, row-level
status tracking (pending/active/reviewed/corrected/skipped), and
auto-scroll during SSE streaming. Load previous results on mount.

StepReconstruction: New step 7 with editable text fields at original
bbox positions over dewarped image. Zoom controls, tab navigation,
color-coded columns, save to backend.

Backend: Add POST /sessions/{id}/reconstruction endpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 12:19:21 +01:00
Benjamin Admin
2a493890b6 feat(ocr-pipeline): add SSE streaming and phonetic filter to LLM review
- Stream LLM review results batch-by-batch (8 entries per batch) via SSE
- Frontend shows live progress bar, batch log, and corrections appearing
- Skip entries with IPA phonetic transcriptions (already dictionary-corrected)
- Refactor llm_review_entries into reusable helpers for both streaming and non-streaming paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:46:06 +01:00
Benjamin Admin
e171a736e7 fix(ocr-pipeline): increase LLM timeout to 300s and disable qwen3 thinking
- Add /no_think tag to prompt (qwen3 thinking mode causes massive slowdown)
- Increase httpx timeout from 120s to 300s for large vocab tables
- Improve error logging with traceback and exception type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:31:03 +01:00
Benjamin Admin
938d1d69cf feat(ocr-pipeline): add LLM-based OCR correction step (Step 6)
Replace the placeholder "Koordinaten" step with an LLM review step that
sends vocab entries to qwen3:30b-a3b via Ollama for OCR error correction
(e.g. "8en" → "Ben"). Teachers can review, accept/reject individual
corrections in a diff table before applying them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:13:17 +01:00
Benjamin Admin
e9f368d3ec feat(ocr-pipeline): add abbreviation allowlist to noise filter
Add _KNOWN_ABBREVIATIONS set with ~150 common EN/DE abbreviations
(sth, sb, etc, eg, ie, usw, bzw, vgl, adj, adv, prep, sg, pl, ...).
Tokens matching known abbreviations are never stripped as noise.

Also handle dotted abbreviations (e.g., z.B., i.e.) that have no
2+ consecutive alpha chars by checking the abbreviation set before
the _RE_REAL_WORD filter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:46:54 +01:00
Benjamin Admin
3028f421b4 feat(ocr-pipeline): add cell text noise filter for OCR artifacts
Add _clean_cell_text() with three sub-filters to remove OCR noise:
- _is_garbage_text(): vowel/consonant ratio check for phantom row garbage
- _is_noise_tail_token(): dictionary-based trailing noise detection
- _RE_REAL_WORD check for cells with no real words (just fragments)

Handles balanced parentheses "(auf)" and trailing hyphens "under-"
as legitimate tokens while stripping noise like "Es)", "3", "ee", "B".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:19:31 +01:00
Benjamin Admin
2b1c499d54 fix(ocr-pipeline): filter OCR noise from image areas and artifacts
Two generic noise filters added to _ocr_single_cell():

1. Word confidence filter (conf < 30): removes low-confidence words
   before text assembly.  Catches trailing artifacts like "Es)" after
   real text, and standalone noise from image edges.

2. Cell noise filter: clears cells whose entire text has no real
   alphabetic word (>= 2 letters).  Catches fragments like "E:", "3",
   "u", "D", "2.77", "and )" from image areas, while keeping real
   short words like "Ei", "go", "an".

Both filters apply to word-lookup AND cell-OCR fallback results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:56:54 +01:00
Benjamin Admin
72cc77dcf4 fix(ocr-pipeline): cells = result, no post-processing content shuffling
The cell grid IS the result. Each cell stays at its detected position.

Removed _split_comma_entries and _attach_example_sentences from the
pipeline — they were shuffling content between rows/columns, causing
"Mäuse" to appear in a separate row, "stand..." to move to Example,
and "Ei" to disappear.

Now: cells → _cells_to_vocab_entries (1:1 row mapping) →
_fix_character_confusion → _fix_phonetic_brackets → done.

Also lowered pixel-density threshold from 2% to 0.5% for the cell-OCR
fallback so small text like "Ei" is not filtered out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:41:30 +01:00
Benjamin Admin
e3f939a628 refactor(ocr-pipeline): make post-processing fully generic
Three non-generic solutions replaced with universal heuristics:

1. Cell-OCR fallback: instead of restricting to column_en/column_de,
   now checks pixel density (>2% dark pixels) for ANY column type.
   Truly empty cells are skipped without running Tesseract.

2. Example-sentence detection: instead of checking for example-column
   text (worksheet-specific), now uses sentence heuristics (>=4 words
   or ends with sentence punctuation). Short EN text without DE is
   kept as a vocab entry (OCR may have missed the translation).

3. Comma-split: re-enabled with singular/plural detection. Pairs like
   "mouse, mice" / "Maus, Mäuse" are kept together. Verb forms like
   "break, broke, broken" are still split into individual entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:27:30 +01:00
Benjamin Admin
6bca3370e0 fix(ocr-pipeline): fix vocab post-processing destroying correct cell results
Three bugs in the post-processing pipeline were overwriting correct
streaming results with wrong ones:

1. _split_comma_entries was splitting "Maus, Mäuse" into two separate
   entries. Disabled — word forms belong together.

2. _attach_example_sentences treated "Ei" (2 chars) as OCR noise due
   to `len(de) > 2` threshold. Lowered to `len(de) > 1`.

3. _attach_example_sentences wrongly classified rows with EN text but
   no DE (like "stand ...") as example sentences, merging them into
   the previous entry. Now only treats rows as examples if they also
   have no text in the example column.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:16:50 +01:00
Benjamin Admin
befc44d2dd perf(ocr-pipeline): limit cell-OCR fallback to EN/DE columns only
Skip Tesseract fallback for column_example cells which are often
legitimately empty.  This reduces ~48 Tesseract calls to ~10,
cutting Step 5 fallback time from ~13s to ~3s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:01:08 +01:00
Benjamin Admin
6db3c02db4 fix(admin-lehrer): force unique build ID to bust browser caches
Next.js was producing the same chunk hash across builds, causing
browsers to serve stale cached JS even after redeployment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 08:54:05 +01:00
Benjamin Admin
8f2c2e8f68 feat(ocr-pipeline): hybrid word-lookup with cell-OCR fallback
Word-lookup from full-page Tesseract is fast but can miss small or
isolated words (e.g. "Ei"). Now falls back to per-cell Tesseract OCR
for cells that remain empty after word-lookup. The ocr_engine field
reports 'cell_ocr_fallback' for cells that needed the fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 08:21:12 +01:00
Benjamin Admin
50ad06f43a fix(ocr-pipeline): always run fresh word detection, skip stale cache
Word-lookup is now ~0.03s (vs seconds with per-cell Tesseract), so
always re-run detection when entering Step 5 instead of showing
potentially stale cached word_result from the session DB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 08:05:13 +01:00
Benjamin Admin
2c4160e4c4 fix(ocr-pipeline): exclusive word-to-column assignment prevents duplicates
Replace per-cell word filtering (which allowed the same word to appear in
multiple columns due to padded overlap) with exclusive nearest-center
assignment. Each word is assigned to exactly one column per row.

Also use row height as Y-tolerance for text assembly so words within
the same row (e.g. "Maus, Mäuse") are always grouped on one line.

Fixes: words leaking into wrong columns, missing words, duplicate words.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:54:45 +01:00
Benjamin Admin
9bbde1c03e fix(ocr-pipeline): re-populate row.words for word-lookup in Step 5
The row_result stored in DB excludes words to keep payload small.
When Step 5 reconstructs RowGeometry from DB, words were empty,
causing word-lookup to find nothing and return blank cells.

Now re-populates row.words from cached _word_dicts (or re-runs
detect_column_geometry if cache is cold) before cell grid building.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:38:33 +01:00
Benjamin Admin
77869e32f4 feat(ocr-pipeline): use word-lookup instead of cell-OCR for cell grid
Replace per-cell Tesseract re-runs with lookup of pre-existing full-page
words from row.words. Words are filtered by X-overlap with column bounds.
This fixes phantom rows with garbage text, missing last words, and
incomplete example text by using the more reliable full-page OCR results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:24:46 +01:00
Benjamin Admin
89b5f49918 fix(ocr-pipeline): filter phantom rows with word_count=0 from cell grid
Rows in inter-line whitespace gaps have no Tesseract words assigned but
were still processed by build_cell_grid, producing garbage OCR output.
Filter these phantom rows using the word_count field set during Step 4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 18:40:13 +01:00
Benjamin Admin
7f27783008 feat(ocr-pipeline): add SSE streaming for word recognition (Step 5)
Cells now appear one-by-one in the UI as they are OCR'd, with a live
progress bar, instead of waiting for the full result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:54:20 +01:00
Benjamin Admin
a666e883da fix(ocr-pipeline): exclude header/footer/page_ref from cell grid columns
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:33:48 +01:00
Benjamin Admin
27b895a848 feat(ocr-pipeline): generic cell-grid with optional vocab mapping
Extract build_cell_grid() as layout-agnostic foundation from
build_word_grid(). Step 5 now produces a generic cell grid (columns x
rows) and auto-detects whether vocab layout is present. Frontend
dynamically switches between vocab table (EN/DE/Example) and generic
cell table based on layout type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:22:56 +01:00
Benjamin Admin
3bcb7aa638 fix(ocr-pipeline): remove overzealous grid row count validation
The validation that rejected word-center grid when it produced more rows
than gap-based detection was causing fallback to gap-based rows (large
boxes). The word-center grid regularization works correctly after the
center-based grouping and cluster merging fixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 13:01:27 +01:00
Benjamin Admin
c4f2e6554e fix(ocr-pipeline): prevent grid from producing more rows than gap-based
Two fixes:
1. Grid validation: reject word-center grid if it produces MORE rows
   than gap-based detection (more rows = lines were split = worse).
   Falls back to gap-based rows in that case.

2. Words overlay: draw clean grid cells (column × row intersections)
   instead of padded entry bboxes. Eliminates confusing double lines.
   OCR text labels are placed inside the grid cells directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:52:41 +01:00
Benjamin Admin
8e861e5a4d fix(ocr-pipeline): use gap-based row height for cluster tolerance
The y_tolerance for word-center clustering was based on median word
height (21px → 12px tolerance), which was too small. Words on the
same line can have centers 15-20px apart due to different heights.

Now uses 40% of the gap-based median row height as tolerance (e.g.
40px row → 16px tolerance), and 30% for merge threshold. This
produces correct cluster counts matching actual text lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:34:15 +01:00
Benjamin Admin
4970ca903e fix(ocr-pipeline): invalidate downstream results when steps are re-run
When columns change (Step 3), invalidate row_result and word_result.
When rows change (Step 4), invalidate word_result.
This ensures Step 5 always uses the latest row boundaries instead of
showing stale cached word_result from a previous run.

Applies to both auto-detection and manual override endpoints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:24:44 +01:00
Benjamin Admin
97d4355aa9 fix(ocr-pipeline): group words by vertical center, merge close clusters
Fix half-height rows caused by tall special characters (brackets, IPA
symbols) being split into separate line clusters:

- Group words by vertical CENTER instead of TOP position, so tall
  characters on the same line stay in one cluster
- Filter outlier-height words (>2× median) when computing letter_h
  so brackets/IPA don't skew the row height
- Merge clusters closer than 0.4× median word height (definitely
  same text line despite slight center differences)
- Increased y_tolerance from 0.5× to 0.6× median word height
- Enhanced logging with cluster merge count and row height range

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:14:42 +01:00
Benjamin Admin
8ad5823fd8 feat(ocr-pipeline): word-center grid with section-break detection
Replace rigid uniform grid with bottom-up approach that derives row
boundaries from word vertical centers:
- Group words into line clusters, compute center_y per cluster
- Compute pitch (distance between consecutive centers)
- Detect section breaks where gap > 1.8× median pitch
- Place row boundaries at midpoints between consecutive centers
- Per-section local pitch adapts to heading/paragraph spacing
- Validate ≥85% word placement, fallback to gap-based rows

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 12:04:08 +01:00
Benjamin Admin
ec47045c15 feat(ocr-pipeline): uniform grid regularization for row detection (Step 7)
Replace _split_oversized_rows() with _regularize_row_grid(). When ≥60%
of content rows have consistent height (±25% of median), overlay a
uniform grid with the standard row height over the entire content area.
This leverages the fact that books/vocab lists use constant row heights.

Validates grid by checking ≥85% of words land in a grid row. Falls back
to gap-based rows if heights are too irregular or words don't fit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:50:50 +01:00
Benjamin Admin
ba65e47654 feat(ocr-pipeline): move oversized row splitting from Step 5 to Step 4
Implement _split_oversized_rows() in detect_row_geometry() (Step 7) to
split content rows >1.5× median height using local horizontal projection.
This produces correctly-sized rows before word OCR runs, instead of
working around the issue in Step 5 with sub-cell splitting hacks.

Removed Step 5 workarounds: _split_oversized_entries(), sub-cell
splitting in build_word_grid(), and median_row_h calculation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:46:18 +01:00
Benjamin Admin
8507e2e035 fix(ocr-pipeline): split oversized cells before OCR to capture all text
For cells taller than 1.5× median row height, split vertically into
sub-cells and OCR each separately. This fixes RapidOCR losing text
at the bottom of tall cells (e.g. "floor/Fußboden" below "egg/Ei"
in a merged row). Generic fix — works for any oversized cell.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:32:10 +01:00
Benjamin Admin
854d8b431b feat(rag-qa): add 14 missing PDF mappings for EDPB, ENISA, EDPS, TMG, UrhG
Adds entries for all regulation codes in REGULATIONS_IN_RAG that were
missing from RAG_PDF_MAPPING, fixing "Kein PDF-Mapping" messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:10:09 +01:00
Benjamin Admin
f2521d2b9e feat(ocr-pipeline): British/American IPA pronunciation choice
- Integrate Britfone dictionary (MIT, 15k British English IPA entries)
- Add pronunciation parameter: 'british' (default) or 'american'
- British uses Britfone (Received Pronunciation), falls back to CMU
- American uses eng_to_ipa/CMU, falls back to Britfone
- Frontend: dropdown to switch pronunciation, default = British
- API: ?pronunciation=british|american query parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 11:08:52 +01:00
Benjamin Admin
954d21e469 fix: use local Inter font to avoid Google Fonts timeout in Docker build
The Docker container cannot reach Google Fonts, causing build failures.
Switch to bundled local font file using next/font/local.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:26:34 +01:00
Benjamin Admin
010616be5a fix(ocr-pipeline): generic example attachment + cell padding
1. Semantic example matching: instead of attaching example sentences
   to the immediately preceding entry, find the vocab entry whose
   English word(s) appear in the example. "a broken arm" → matches
   "broken" via word overlap, not "egg/Ei". Uses stem matching for
   word form variants (break/broken share stem "bro").

2. Cell padding: add 8px padding to each cell region so words at
   column/row edges don't get clipped by OCR (fixes "er wollte"
   missing at cell boundaries).

3. Treat very short DE text (≤2 chars) as OCR noise, not real
   translation — prevents false positives in example detection.

All fixes are generic and deterministic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:24:28 +01:00