Lower syllable pipe-ratio threshold from 5% to 1%
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Real dictionary pages have only ~3% OCR-detected pipes because the thin syllable divider lines are hard for OCR to read. The primary false-positive guard (article_col_index check) already blocks synonym dictionaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -200,10 +200,9 @@ def insert_syllable_dividers(
|
||||
For dictionary pages: process all content column cells, strip existing
|
||||
pipes, merge pipe-gap spaces, and re-syllabify using pyphen.
|
||||
|
||||
Pre-check: at least 5% of content cells must already contain ``|`` from
|
||||
OCR. This guards against false-positive dictionary detection on pages
|
||||
like synonym dictionaries or alphabetical word lists that have no actual
|
||||
syllable divider lines.
|
||||
Pre-check: at least 1% of content cells must already contain ``|`` from
|
||||
OCR. This guards against pages with zero pipe characters (the primary
|
||||
guard — article_col_index — is checked at the call site).
|
||||
|
||||
Returns the number of cells modified.
|
||||
"""
|
||||
@@ -227,10 +226,10 @@ def insert_syllable_dividers(
|
||||
|
||||
if total_col_cells > 0:
|
||||
pipe_ratio = cells_with_pipes / total_col_cells
|
||||
if pipe_ratio < 0.05:
|
||||
if pipe_ratio < 0.01:
|
||||
logger.info(
|
||||
"build-grid session %s: skipping syllable insertion — "
|
||||
"only %.1f%% of cells have existing pipes (need >=5%%)",
|
||||
"only %.1f%% of cells have existing pipes (need >=1%%)",
|
||||
session_id, pipe_ratio * 100,
|
||||
)
|
||||
return 0
|
||||
|
||||
Reference in New Issue
Block a user