Commit Graph

7 Commits

Author SHA1 Message Date
Benjamin Admin
21b69e06be Fix cross-column word assignment by splitting OCR merge artifacts
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 47s
CI / test-go-edu-search (push) Successful in 36s
CI / test-python-klausur (push) Failing after 2m21s
CI / test-python-agent-core (push) Successful in 19s
CI / test-nodejs-website (push) Successful in 23s
When OCR merges adjacent words from different columns into one word box
(e.g. "sichzie" spanning Col 1+2, "dasZimmer" crossing boundary), the
grid builder assigned the entire merged word to one column.

New _split_cross_column_words() function splits these at column
boundaries using case transitions and spellchecker validation to
avoid false positives on real words like "oder", "Kabel", "Zeitung".

Regression: 12/12 GT sessions pass with diff=+0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 10:54:41 +01:00
Benjamin Admin
a8773d5b00 Fix 4 Grid Editor bugs: syllable modes, heading detection, word gaps
1. Syllable "Original" (auto) mode: only normalize cells that already
   have | from OCR — don't add new syllable marks via pyphen to words
   without printed dividers on the original scan.

2. Syllable "Aus" (none) mode: strip residual | chars from OCR text
   so cells display clean (e.g. "Zel|le" → "Zelle").

3. Heading detection: add text length guard in single-cell heuristic —
   words > 4 alpha chars starting lowercase (like "zentral") are regular
   vocabulary, not section headings.

4. Word-gap merge: new merge_word_gaps_in_zones() step with relaxed
   threshold (6 chars) fixes OCR splits like "zerknit tert" → "zerknittert".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-27 15:24:35 +01:00
Benjamin Admin
e019dde01b Extract page number as metadata instead of silently removing it
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 27s
CI / test-go-edu-search (push) Successful in 36s
CI / test-python-klausur (push) Failing after 2m9s
CI / test-python-agent-core (push) Successful in 17s
CI / test-nodejs-website (push) Successful in 21s
_filter_footer_words now returns page number info (text, y_pct, number)
instead of just removing footer words. The page number is included in
the grid result as `page_number` and displayed in the frontend summary
bar as "S. 233".

This preserves page numbers for later page concatenation in the
customer frontend while still removing them from the grid content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 08:52:09 +01:00
Benjamin Admin
a73ddce43d Fix missing PageZone import in grid_editor_helpers.py
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 15s
The zone merging function used PageZone but the import was only
in grid_editor_api.py. Caused NameError on sessions that trigger
zone merging (e.g. original_scan_b59a1b1b).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-25 22:04:21 +01:00
Benjamin Admin
76cd1ac020 Fix false headers on sparse layouts and IPA corruption on German text
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 33s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m55s
CI / test-python-agent-core (push) Successful in 14s
CI / test-nodejs-website (push) Successful in 17s
1. Header detection: Add 25% cap to single-cell heading heuristic.
   On German synonym dicts where most rows naturally have only 1
   content cell, the old logic marked 60%+ of rows as headers.

2. IPA de/all mode: Use "column_text" (light processing) for non-
   English columns instead of "column_en" (full processing). The
   full path runs _insert_missing_ipa() which splits on whitespace,
   matches English prefixes ("bildschön" → "bild"), and truncates
   the rest — destroying German comma-separated synonym lists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-25 21:49:05 +01:00
Benjamin Admin
52b66ebe07 Fix NameError: _text_has_garbled_ipa not imported in grid_editor_helpers
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 26s
CI / test-python-klausur (push) Failing after 1m52s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
After refactoring grid_editor_api.py into helpers, the function
_text_has_garbled_ipa was used in _detect_heading_rows_by_single_cell
but never imported from cv_ocr_engines. This caused HTTP 500 on
build-grid for sessions that trigger single-cell heading detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 15:11:29 +01:00
Benjamin Admin
12b4c61bac refactor: extract grid helpers + generic CV-gated syllable insertion
1. Extracted 1367 lines of helper functions from grid_editor_api.py
   (3051→1620 lines) into grid_editor_helpers.py (filters, detectors,
   zone grid building).

2. Created cv_syllable_detect.py with generic CV+pyphen logic:
   - Checks EVERY word_box for vertical pipe lines (not just first word)
   - No article-column dependency — works with any dictionary layout
   - CV morphological detection gates pyphen insertion

3. Grid editor scroll: calc(100vh-200px) for reliable scrolling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 14:39:33 +01:00