Commit Graph

400 Commits

Author SHA1 Message Date
Benjamin Admin
21d37b5da1 Fix prefix matching: use alpha-only chars, min 4-char prefix
Prevents false positives where punctuation (apostrophes) in merged
tokens caused wrong dictionary matches (e.g. "'se" from "'sekandarr"
matching as a word, breaking IPA continuation row fix).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 10:40:37 +01:00
Benjamin Admin
19cbbf310a Improve garbled IPA cleanup: trailing strip, prefix match, broader guard
1. Strip trailing garbled IPA after proper [IPA] brackets
   (e.g. "sea [sˈiː] si:" → "sea [sˈiː]")
2. Add prefix matching for merged tokens where OCR joined headword
   with garbled IPA (e.g. "schoolbagsku:lbæg" → "schoolbag [skˈuːlbæɡ]")
3. Broaden guard to also trigger on trailing non-dictionary words
   (e.g. "scare skea" → "scare [skˈɛə]")

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 10:36:25 +01:00
Benjamin Admin
fc0ab84e40 Fix garbled IPA in continuation rows using headword lookup
IPA continuation rows (phonetic transcription that wraps below the
headword) now get proper IPA by looking up headwords from the row
above. E.g. "ska:f – ska:vz" → "[skˈɑːf] – [skˈɑːvz]".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 10:28:14 +01:00
Benjamin Admin
050d410ba0 Preserve IPA continuation rows in grid output
Stop removing rows that contain only phonetic transcription below
the headword. These rows are valid content that users need to see.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 10:22:58 +01:00
Benjamin Admin
038eaf783c Only insert IPA when garbled phonetics exist in OCR text
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
_insert_missing_ipa was adding dictionary IPA to cells that had NO
phonetic transcription on the original page (e.g. "scissors" heading,
"scarf - scarves" without IPA). Now guarded by _text_has_garbled_ipa()
which checks for OCR-mangled phonetic markers (stress marks, length
marks, IPA special chars) before allowing insertion.

Rule: if a line has no phonetics, don't add any. Where garbled IPA
exists, replace it with correct IPA notation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 09:59:21 +01:00
Benjamin Admin
432eee3694 Auto-filter decorative margin strips and header junk
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m45s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 15s
- _filter_decorative_margin: Phase 2 now also removes short words (<=3
  chars) in the same narrow x-range as the detected single-char strip,
  catching multi-char OCR artifacts like "Vv" from alphabet graphics.
- _filter_header_junk: New filter detects the content start (first row
  with 3+ high-confidence words) and removes low-conf short fragments
  above it that are OCR artifacts from header illustrations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 09:38:24 +01:00
Benjamin Admin
8e4cbd84c2 Invalidate grid_editor_result when exclude regions change
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 17s
When exclude regions are saved or deleted, the cached grid result is
cleared so the grid rebuilds with updated exclusions on the next step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 09:19:09 +01:00
Benjamin Admin
f9d71d50d1 Add exclude region marking in Structure step
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m47s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
Users can now draw rectangles on the document image in the Structure
Detection step to mark areas (e.g. header graphics, alphabet strips)
that should be excluded from OCR results during grid building.

- Backend: PUT/DELETE endpoints for exclude regions stored in structure_result
- Backend: _build_grid_core() filters all words inside user-defined exclude regions
- Frontend: Interactive rectangle drawing with visual overlay and delete buttons
- Preserve exclude regions when re-running structure detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 09:08:30 +01:00
Benjamin Admin
c09838e91c Fix spine shadow false positives: require dark valley, brightness rise, trim convolution edges
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m54s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
The _detect_spine_shadow function was triggering on normal text content
because shadow_range > 20 was too low and convolution edge artifacts
created artificially low values. Now requires: range > 40, darkest < 180,
narrow valley (not text plateau), and brightness rise toward page content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 08:23:50 +01:00
Benjamin Admin
3fd6523872 Cut at spine center (darkest point) instead of shadow edge
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 2m5s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Refactor left/right shadow detection into shared _detect_spine_shadow()
that finds the darkest column (= book spine center) via argmin of
smoothed brightness. Both sides now cut at the spine center, ensuring
equal page sizes in double-page scans regardless of shadow position.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 07:54:33 +01:00
Benjamin Admin
e56391b0c3 Add right-edge spine shadow detection for book scans
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 37s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 22s
Mirror the left-edge shadow detection for the right side: analyze
brightness gradient in the right 25% to find scanner gray strips
from book spines. Cuts at the last bright column before the shadow
dip. Fixes cropping of book scans where the next page bleeds in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 07:41:13 +01:00
Benjamin Admin
a3e2a7f994 Add GT button to OCR overlay, prominent category picker, track pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 18s
- Ground Truth button on last step of Pipeline/Kombi modes in ocr-overlay
- Prominent category picker in active session info bar (pulses when unset)
- GT badge shown when session has ground truth reference
- Backend: auto-detect pipeline from ocr_engine, store in GT snapshot
- Pipeline info shown in GT session list and regression reports
- Also pass pipeline param from ocr-pipeline StepGroundTruth

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 14:49:02 +01:00
Benjamin Admin
f655db30e4 Add Ground Truth regression test system for OCR pipeline
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 35s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m47s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 22s
Extract _build_grid_core() from build_grid() endpoint for reuse.
New ocr_pipeline_regression.py with endpoints to mark sessions as
ground truth, list them, and run regression comparisons after code
changes. Frontend button in StepGroundTruth.tsx to mark/update GT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 13:46:48 +01:00
Benjamin Admin
c894a0feeb Improve IPA continuation row detection with phonetic heuristics
Strip IPA brackets that fix_cell_phonetics may have added for short
dictionary words (e.g. "si" → "[si]") before checking if the row is
a garbled phonetic continuation. Detect phonetic text by presence of
':' (length marks), leading apostrophe (stress marks), or absence of
any word with ≥3 letters.

Fixes Row 39 ("si: [si] — So: - si:n") not being removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 12:08:21 +01:00
Benjamin Admin
8ef4c089cf Remove IPA continuation rows and support hyphenated word lookup
- grid_editor_api: After IPA correction, detect rows containing only
  garbled phonetics in the English column (no German translation, no
  IPA brackets inserted). These are wrap-around lines where printed
  IPA extends to the line below the headword. Remove them since the
  headword row already has correct IPA.
- cv_ocr_engines: _insert_missing_ipa now tries dehyphenated form
  as fallback (e.g. "second-hand" → "secondhand") for dictionary
  lookup, fixing IPA insertion for compound words.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 12:05:38 +01:00
Benjamin Admin
821e5481c2 Only apply IPA correction on vocabulary tables (≥3 columns)
Single-column German text pages were getting IPA inserted for words
that happen to exist in the English dictionary ("die" → [dˈaɪ],
"Das" → [dɑs]). Now IPA correction only runs when the grid has ≥3
columns, which is the minimum for a vocabulary table layout
(English | article | German).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 11:50:03 +01:00
Benjamin Admin
b98ea33a3a Strip garbled OCR phonetics after IPA insertion
_insert_missing_ipa now removes garbled phonetic text (e.g. "skea",
"sku:l", "'sizaz") that follows the inserted IPA bracket. Keeps
delimiters (–, -), uppercase words (German), and known English words.

Fixes: "scare [skˈɛə] skea" → "scare [skˈɛə]"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 11:15:14 +01:00
Benjamin Admin
f139d0903e Preserve alphabetic marker columns, broaden junk filter, enable IPA in grid
- _merge_inline_marker_columns: skip merge when ≥50% of words are
  alphabetic (preserves "to", "in", "der" columns)
- Rule 2 (oversized stub): widen to ≤3 words / ≤5 chars (catches "SEA &")
- IPA phonetics: map longest-avg-text column to column_en so
  fix_cell_phonetics runs in the grid editor
- ocr_pipeline_overlays: add missing split_page_into_zones import

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 11:08:23 +01:00
Benjamin Admin
962bbbe9f6 Remove scattered debris rows and disable spanning header detection
- Add Rule 3 to junk-row filter: rows where no word is longer than
  2 chars are removed as scattered OCR debris from illustrations
- Fully disable spanning-header detection which falsely flagged IPA
  transcriptions and vocabulary entries as spanning headers
- First-row heuristic remains for genuine header detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 10:47:17 +01:00
Benjamin Admin
9da45c2a59 Fix false header detection and add decorative margin/footer filters
- Remove all_colored spanning header heuristic that falsely flagged
  colored vocabulary entries (Scotland, secondary school) as headers
- Add _filter_decorative_margin: removes vertical A-Z alphabet strips
  along page margins (single-char words in a compact vertical strip)
- Add _filter_footer_words: removes page numbers in bottom 5% of page
- Tighten spanning header rule: require ≥3 columns spanned + ≤3 words

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 10:38:20 +01:00
Benjamin Admin
64447ad352 Raise color sat_threshold from 50 to 55 to avoid scanner blue artifacts
Black text has median_sat ~6-7, green text ~63-65. At threshold 50,
scanner blue tints (median_sat ~50-54) on words like "Wasser" were
falsely classified as blue. Threshold 55 has good margin on both sides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 09:13:09 +01:00
Benjamin Admin
00cbf266cb Add oversized-stub filter for large page numbers/marks in grid rows
Rows with ≤2 words, total text ≤3 chars, and word height >1.8x median
are removed as non-content elements (e.g. red page number "( 9").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 09:05:07 +01:00
Benjamin Admin
f9bad7beaa Filter phantom rows from recovered color artifacts and low-conf OCR noise
- Apply recovered-artifact filter to ALL zones (was box-zones only)
- Filter any recovered word with text ≤ 2 chars (not just !?•·)
- Add post-grid junk-row removal: rows where all word_boxes have
  conf < 50 and text ≤ 3 chars are dropped as OCR noise

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 09:00:43 +01:00
Benjamin Admin
143e41ec76 add: ocr_pipeline_overlays.py for overlay rendering functions
Extracted 4 overlay functions (_get_structure_overlay, _get_columns_overlay,
_get_rows_overlay, _get_words_overlay) that were missing from the initial
split. Provides render_overlay() dispatcher used by sessions module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:46:49 +01:00
Benjamin Admin
ec287fd12e refactor: split ocr_pipeline_api.py (5426 lines) into 8 modules
Each module is under 1050 lines:
- ocr_pipeline_common.py (354) - shared state, cache, models, helpers
- ocr_pipeline_sessions.py (483) - session CRUD, image serving, doc-type
- ocr_pipeline_geometry.py (1025) - deskew, dewarp, structure, columns
- ocr_pipeline_rows.py (348) - row detection, box-overlay helper
- ocr_pipeline_words.py (876) - word detection (SSE), paddle-direct
- ocr_pipeline_ocr_merge.py (615) - merge helpers, kombi endpoints
- ocr_pipeline_postprocess.py (929) - LLM review, reconstruction, export
- ocr_pipeline_auto.py (705) - auto-mode orchestrator, reprocess

ocr_pipeline_api.py is now a 61-line thin wrapper that re-exports
router, _cache, and test-imported symbols for backward compatibility.
No changes needed in main.py or tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:42:00 +01:00
Benjamin Admin
98f7f7d7d5 fix: NameError in paddle_kombi/rapid_kombi cache update
The previous commit added `cached["word_result"]` but `cached` was
not defined in these functions. Changed to safely check `_cache` dict
first. Also includes sat_threshold fix (70→50) for green text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:12:01 +01:00
Benjamin Admin
a19bca6060 fix: lower color sat_threshold from 70 to 50 for green text detection
Green text words like "Insel" and "Internet" had median_sat=65, just
below the threshold of 70, causing them to be classified as black.
Black text has median_sat=6-7, so threshold=50 provides clear
separation (6-7 vs 63-65) without false positives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 08:00:35 +01:00
Benjamin Admin
7a76697f95 fix: always re-run structure detection instead of using cached result
The frontend was checking for an existing structure_result and reusing
it, which meant the backend fix (passing word_boxes to graphic detection)
never had a chance to run on existing sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 07:43:44 +01:00
Benjamin Admin
5359a4cc2b fix: cache word_result in paddle_kombi/rapid_kombi for detect-structure
Both kombi OCR functions wrote word_result to DB but not to the
in-memory cache. When detect-structure ran next, it found no words
and passed an empty list to graphic detection, making all word-overlap
heuristics ineffective. This caused green text words to be wrongly
classified as graphic regions.

Also adds a fallback in detect-structure to use raw OCR word lists
if cell word_boxes are empty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 07:29:02 +01:00
Benjamin Admin
a25214126d fix: merge overlapping OCR words with different text (Stick/Stück)
Two issues in paddle-kombi word merge:

1. Overlap threshold too strict: PaddleOCR "Stick" and Tesseract
   "Stück" overlap at 48.6%, just below the 50% threshold. Both words
   ended up in the result, overlapping on the same position.
   Fix: lower threshold from 50% to 40%.

2. Text selection blind to confidence: always took PaddleOCR text
   even when Tesseract had higher confidence and correct text.
   Fix: when texts differ due to spatial-only match, prefer the
   engine with higher confidence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 07:00:57 +01:00
Benjamin Admin
fd79d5e4fa fix: prevent grid table overflow when union columns exceed zone bbox
When union columns from multiple content zones are applied, column
boundaries can span wider than any single zone's bbox. Using
zone.bbox_px.w as the scale reference caused the total scaled width
to exceed the container, pushing the table off-screen.

Now uses the actual total column width sum as the scale reference,
guaranteeing columns always fit within the container.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 19:43:00 +01:00
Benjamin Admin
19b93f7762 fix: conservative column detection + smart graphic word filter
Column detection:
- Raise MIN_COVERAGE_PRIMARY 20%→35% (prevents false columns in
  flowing text where random gaps < 35% of rows)
- Raise MIN_COVERAGE_SECONDARY 12%→20%, MIN_DISTINCT_ROWS 2→3
- Vocabulary worksheets unaffected (columns appear in >80% of rows)

Graphic word filter:
- Only remove words with OCR confidence < 50 inside graphic regions
- High-confidence words are real text, not image artifacts
- Prevents legitimate colored text from being discarded

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 18:19:25 +01:00
Benjamin Admin
a079ffe8e9 fix: robust colored-text detection in graphic filter
The 25x25 dilation kernel merges nearby green words into large regions,
so pixel-overlap with OCR word boxes drops below 50%. Previous density
checks alone weren't sufficient.

New multi-layered approach:
- Count OCR word CENTROIDS inside each colored region
- ≥2 centroids → definitely text (images don't produce multiple words)
- 1 centroid + 10%+ pixel overlap → likely text
- Lower pixel overlap threshold from 50% to 40%
- Raise density+height thresholds for text-line detection
- Use INFO logging to diagnose remaining false positives

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 18:09:16 +01:00
Benjamin Admin
6e1d715d0d fix: prevent colored text from being falsely detected as graphics
Add color pixel density checks to cv_graphic_detect.py Pass 1:
- density < 20% → skip (text strokes are thin, images are filled)
- density < 30% + height < 4% page → skip (colored text line)

This fixes green headings (Insel, Internet, Inuit) being removed
as graphic regions, which also caused word reordering in lines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 17:30:35 +01:00
Benjamin Admin
d66efdecf5 fix: NameError in detect_page_splits — 'gaps' var removed in rewrite
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 22s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 17:01:34 +01:00
Benjamin Admin
d36972b464 fix: detect spine by brightness, not ink density
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 19s
The previous algorithm used binary ink projection and found false
splits at normal text column gaps. The spine of a book on a scanner
has a characteristic DARK gray strip (scanner bed) flanked by bright
white paper on both sides.

New approach: column-mean brightness with heavy smoothing, looking for
a dark valley (< 88% of paper brightness) in the center region that
has bright paper on both sides.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 16:52:29 +01:00
Benjamin Admin
f30e526917 fix: merge nearby spine gaps + handle multi-page crop in frontend
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Backend: merge gaps within 5% of image width — the spine area may have
thin ink strips splitting one physical gap into multiple detected gaps.
Only use gaps >= 2% width as split points.

Frontend: StepCrop now handles multi_page crop responses without
crashing on missing original_size/cropped_size fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 16:44:32 +01:00
Benjamin Admin
438a4495c7 fix: swap 90°/270° rotation direction in orientation detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 18s
Tesseract OSD 'rotate' returns the clockwise correction needed,
but the code was applying counterclockwise for 90° and clockwise
for 270° — exactly reversed. This caused pages scanned sideways
to be flipped upside down instead of corrected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 16:39:15 +01:00
Benjamin Admin
902de027f4 feat: auto-detect multi-page spreads and split into sub-sessions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 19s
When a book scan (double-page spread) is detected during the crop step,
the system automatically:
1. Detects vertical center gaps (spine area) via ink density projection
2. Splits into N page sub-sessions (reusing existing sub-session mechanism)
3. Individually crops each page (removing its own borders)
4. Returns sub-session IDs for downstream pipeline processing

Detection: landscape images (w > h * 1.15), vertical gap < 15% peak
density in center region (25-75%), gap width >= 0.8% of image width.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 16:34:06 +01:00
Benjamin Admin
b1cdb2531c feat: CSS Grid editor with OCR-measured column widths and row heights
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Backend: add layout_metrics (avg_row_height_px, font_size_suggestion_px)
to build-grid response for faithful grid reconstruction.

Frontend: rewrite GridTable from HTML <table> to CSS Grid layout.
Column widths are now proportional to the OCR-measured x_min/x_max
positions. Row heights use the average content row height from the
scan. Column and row resize via drag handles (Excel-like).

Font: add Noto Sans (supports IPA characters) via next/font/google.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 13:48:47 +01:00
Benjamin Admin
ab30e8b17a feat: apply IPA phonetic correction in build-grid combo mode
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 2m1s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 18s
fix_cell_phonetics was only called in the OCR pipeline endpoints
(/words, /cells) but not in the combo mode (build-grid / ocr-overlay).
Garbled IPA like [teist] is now corrected to [teɪst] using the
IPA dictionary, same as in the pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 12:53:58 +01:00
Benjamin Admin
b0e1fbc8d6 feat: box zone artifact filter, spanning headers, parenthesis fix
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m59s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 19s
1. Filter recovered single-char artifacts (!, ?, •) from box zones
   where they are decorative noise, not real text markers

2. Detect spanning header rows (e.g. "Unit4: Bonnie Scotland") that
   stretch across multiple columns with colored text. Merge their
   cells into a single spanning cell in column 0.

3. Fix missing opening parentheses: when cell text has ")" but no
   matching "(", prepend "(" to the text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 11:31:55 +01:00
Benjamin Admin
872b47f691 fix: filter words and color recoveries inside graphic/image regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m8s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 21s
- Load structure_result from session to get detected graphic bounds
- Exclude OCR words whose center falls inside a graphic region
- Exclude recovered colored text inside graphic regions
- Reject color recovery regions wider than 4x median word height

Fixes garbage characters (!, ?, •) in box zones and false OCR
detections (N, ?) in image areas.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 11:20:07 +01:00
Benjamin Admin
bbf0a5720e fix: require both horizontal AND vertical overlap for word dedup
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 30s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 2m11s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 18s
Previous version only checked X overlap, causing false positives for
short words like "=" and "I" that appear at similar X positions in
different rows. Now requires >=50% overlap in both dimensions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:57:44 +01:00
Benjamin Admin
29d3c1caf5 fix: deduplicate overlapping words after Paddle+Tesseract merge
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m56s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
PaddleOCR can return overlapping phrases (e.g. "von jm." and "jm. =")
that produce duplicate words after splitting. Added _deduplicate_words()
post-merge pass that removes words with same text at overlapping positions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:47:42 +01:00
Benjamin Admin
aae8a96aa2 fix: sort word_boxes in reading order (Y-grouped, then X-sorted)
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 29s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 2m0s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 21s
Words on the same visual line can have slightly different top values
(1-6px). Sorting by (top, left) produced wrong word order in the
frontend display. Now uses _group_words_into_lines to group by Y
proximity first, then sort by X within each line.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:41:30 +01:00
Benjamin Admin
2b73d9beec fix: increase color recovery occupancy padding to prevent gap artifacts
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 28s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 18s
Colored-pixel fragments in narrow inter-word gaps were being recovered
as false characters (e.g., "!" between "lend" and "sb."), disrupting
word order. Use adaptive padding based on median word height instead
of fixed 4px.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:28:56 +01:00
Benjamin Admin
324f39a9cc fix: merge inline marker columns + improve ghost edge detection
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 29s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 16s
1. Add _merge_inline_marker_columns(): narrow columns (<80px) with
   avg word length <=2 chars (bullets, numbering) are merged into
   the adjacent text column. Fixes box zones getting 2 columns when
   bullet points are just indentation markers.

2. Improve ghost filter: check word edges (left/right/top/bottom)
   against border bands instead of center-only. Catches = at x=947
   whose left edge touches the box border.

3. Add = and + to _GRID_GHOST_CHARS for border artifact detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 10:10:07 +01:00
Benjamin Admin
febd0a2f84 fix: border ghost filter + row overlap fix for box zones
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 28s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m53s
CI / test-python-agent-core (push) Successful in 16s
CI / test-nodejs-website (push) Successful in 17s
1. Add _filter_border_ghosts() to grid editor - removes OCR artefacts
   like | sitting on box borders before row/column clustering.
   The tall | (h=55) was inflating row 0's y_max, causing row overlap.

2. Fix _assign_word_to_row() to prefer closest y_center when rows
   overlap, instead of always returning the first matching row.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 09:54:50 +01:00
Benjamin Admin
43b1f8be58 diag: increase zone logging threshold to 60 words
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 30s
CI / test-python-klausur (push) Failing after 1m48s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 17s
Box zones have 40-60 words, need to capture their diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 09:49:19 +01:00