fix: _clean_cell_text entfernt Waehrungssymbole am Zeilenende
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 24s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 31s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 1m57s
CI / test-python-agent-core (push) Successful in 20s
CI / test-nodejs-website (push) Successful in 24s
_is_noise_tail_token() stuft rein nicht-alphabetische Tokens wie €0.50, £1, €2.50 als OCR-Noise ein und entfernt sie. Zusaetzlich zerstoert ' '.join(tokens) das proportionale Spacing. Fuer Single-Column Sub-Sessions wird _clean_cell_text uebersprungen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -393,8 +393,13 @@ def build_cell_grid_v2(
|
||||
logger.info(f"R{row_idx:02d}: 0 words (row has "
|
||||
f"{row.word_count} total, y={row.y}..{row.y+row.height})")
|
||||
|
||||
# Apply noise filter
|
||||
text = _clean_cell_text(text)
|
||||
# Apply noise filter — but NOT for single-column sub-sessions:
|
||||
# 1. _clean_cell_text strips trailing non-alpha tokens (e.g. €0.50,
|
||||
# £1, €2.50) which are valid content in box layouts.
|
||||
# 2. _clean_cell_text joins tokens with single space, destroying
|
||||
# the proportional spacing from _words_to_spaced_text.
|
||||
if not is_single_full_column:
|
||||
text = _clean_cell_text(text)
|
||||
|
||||
cell = {
|
||||
'cell_id': f"R{row_idx:02d}_C{col_idx}",
|
||||
|
||||
Reference in New Issue
Block a user