Apply merged-word splitting to grid-editor cells
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 42s
CI / test-go-edu-search (push) Successful in 44s
CI / test-python-klausur (push) Failing after 2m28s
CI / test-python-agent-core (push) Successful in 32s
CI / test-nodejs-website (push) Successful in 32s

The spell review only runs on vocab entries, but the OCR pipeline's
grid-editor cells also contain merged words (e.g. "atmyschool").
Now splits merged words directly in the grid-build finalization step,
right before returning the result. Uses the same _try_split_merged_word()
dictionary-based DP algorithm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-11 14:52:00 +02:00
parent 53b0d77853
commit 6e494a43ab

View File

@@ -1678,6 +1678,38 @@ async def _build_grid_core(
if "|" in t:
cell["text"] = t.replace("|", "")
# --- Split merged words (OCR sometimes glues adjacent words) ---
# Uses dictionary lookup to split e.g. "atmyschool" → "at my school"
try:
from cv_review import _try_split_merged_word, _SPELL_AVAILABLE
if _SPELL_AVAILABLE:
split_count = 0
for z in zones_data:
for cell in z.get("cells", []):
text = cell.get("text", "")
if not text:
continue
parts = []
changed = False
for token in text.split():
# Only try splitting pure-alpha tokens > 7 chars
clean = token.rstrip(".,!?;:'\")")
suffix = token[len(clean):]
if len(clean) > 7 and clean.isalpha():
split = _try_split_merged_word(clean)
if split:
parts.append(split + suffix)
changed = True
continue
parts.append(token)
if changed:
cell["text"] = " ".join(parts)
split_count += 1
if split_count:
logger.info("build-grid session %s: split %d merged words", session_id, split_count)
except ImportError:
pass
# Clean up internal flags before returning
for z in zones_data:
for cell in z.get("cells", []):