fix: _clean_cell_text entfernt Waehrungssymbole am Zeilenende
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 24s

_is_noise_tail_token() stuft rein nicht-alphabetische Tokens wie
€0.50, £1, €2.50 als OCR-Noise ein und entfernt sie. Zusaetzlich
zerstoert ' '.join(tokens) das proportionale Spacing.

Fuer Single-Column Sub-Sessions wird _clean_cell_text uebersprungen.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-10 09:41:25 +01:00
parent 13510b62cc
commit 964c916a81

View File

@@ -393,8 +393,13 @@ def build_cell_grid_v2(
logger.info(f"R{row_idx:02d}: 0 words (row has "
f"{row.word_count} total, y={row.y}..{row.y+row.height})")
# Apply noise filter
text = _clean_cell_text(text)
# Apply noise filter — but NOT for single-column sub-sessions:
# 1. _clean_cell_text strips trailing non-alpha tokens (e.g. €0.50,
# £1, €2.50) which are valid content in box layouts.
# 2. _clean_cell_text joins tokens with single space, destroying
# the proportional spacing from _words_to_spaced_text.
if not is_single_full_column:
text = _clean_cell_text(text)
cell = {
'cell_id': f"R{row_idx:02d}_C{col_idx}",