fix: _clean_cell_text entfernt Waehrungssymbole am Zeilenende

_is_noise_tail_token() stuft rein nicht-alphabetische Tokens wie €0.50, £1, €2.50 als OCR-Noise ein und entfernt sie. Zusaetzlich zerstoert ' '.join(tokens) das proportionale Spacing. Fuer Single-Column Sub-Sessions wird _clean_cell_text uebersprungen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 09:41:25 +01:00
parent 13510b62cc
commit 964c916a81
1 changed files with 7 additions and 2 deletions
@@ -393,8 +393,13 @@ def build_cell_grid_v2(
                        logger.info(f"R{row_idx:02d}: 0 words (row has "
                                    f"{row.word_count} total, y={row.y}..{row.y+row.height})")

-                # Apply noise filter
-                text = _clean_cell_text(text)
+                # Apply noise filter — but NOT for single-column sub-sessions:
+                # 1. _clean_cell_text strips trailing non-alpha tokens (e.g. €0.50,
+                #    £1, €2.50) which are valid content in box layouts.
+                # 2. _clean_cell_text joins tokens with single space, destroying
+                #    the proportional spacing from _words_to_spaced_text.
+                if not is_single_full_column:
+                    text = _clean_cell_text(text)

                cell = {
                    'cell_id': f"R{row_idx:02d}_C{col_idx}",