fix: preserve pipe syllable dividers + detect alphabet sidebar columns
1. Pipe divider fix: Changed OCR char-confusion regex so | between
letters (Ka|me|rad) is NOT converted to I. Only standalone/
word-boundary pipes are converted (|ch → Ich, | want → I want).
2. Alphabet sidebar detection improvements:
- _filter_decorative_margin() now considers 2-char words (OCR reads
"Aa", "Bb" from sidebars), lowered min strip from 8→6
- _filter_border_strip_words() lowered decorative threshold from 50%→45%
- New step 4f: grid-level thin-edge-column filter as safety net —
removes edge columns with <35% fill rate and >60% short text
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -481,8 +481,9 @@ _CHAR_CONFUSION_RULES = [
|
||||
(re.compile(r'\b1([a-z])'), r'I\1'), # 1ch → Ich, 1want → Iwant
|
||||
# Standalone "1" → "I" (English pronoun), but NOT "1." or "1," (list number)
|
||||
(re.compile(r'(?<!\d)\b1\b(?![\d.,])'), 'I'), # "1 want" → "I want"
|
||||
# "|" → "I", but NOT "|." or "|," (those are "1." list prefixes → spell-checker handles them)
|
||||
(re.compile(r'(?<!\|)\|(?!\||[.,])'), 'I'), # |ch → Ich, | want → I want
|
||||
# "|" → "I", but NOT when embedded between letters (syllable divider: Ka|me|rad)
|
||||
# and NOT "|." or "|," (those are "1." list prefixes → spell-checker handles them)
|
||||
(re.compile(r'(?<![a-zA-ZäöüÄÖÜß])\|(?!\||[.,])'), 'I'), # |ch → Ich, | want → I want
|
||||
]
|
||||
|
||||
# Cross-language indicators: if DE has these, EN "1" is almost certainly "I"
|
||||
|
||||
Reference in New Issue
Block a user