Lower syllable pipe-ratio threshold from 5% to 1%
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s
Real dictionary pages have only ~3% OCR-detected pipes because the thin syllable divider lines are hard for OCR to read. The primary false-positive guard (article_col_index check) already blocks synonym dictionaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -200,10 +200,9 @@ def insert_syllable_dividers(
|
|||||||
For dictionary pages: process all content column cells, strip existing
|
For dictionary pages: process all content column cells, strip existing
|
||||||
pipes, merge pipe-gap spaces, and re-syllabify using pyphen.
|
pipes, merge pipe-gap spaces, and re-syllabify using pyphen.
|
||||||
|
|
||||||
Pre-check: at least 5% of content cells must already contain ``|`` from
|
Pre-check: at least 1% of content cells must already contain ``|`` from
|
||||||
OCR. This guards against false-positive dictionary detection on pages
|
OCR. This guards against pages with zero pipe characters (the primary
|
||||||
like synonym dictionaries or alphabetical word lists that have no actual
|
guard — article_col_index — is checked at the call site).
|
||||||
syllable divider lines.
|
|
||||||
|
|
||||||
Returns the number of cells modified.
|
Returns the number of cells modified.
|
||||||
"""
|
"""
|
||||||
@@ -227,10 +226,10 @@ def insert_syllable_dividers(
|
|||||||
|
|
||||||
if total_col_cells > 0:
|
if total_col_cells > 0:
|
||||||
pipe_ratio = cells_with_pipes / total_col_cells
|
pipe_ratio = cells_with_pipes / total_col_cells
|
||||||
if pipe_ratio < 0.05:
|
if pipe_ratio < 0.01:
|
||||||
logger.info(
|
logger.info(
|
||||||
"build-grid session %s: skipping syllable insertion — "
|
"build-grid session %s: skipping syllable insertion — "
|
||||||
"only %.1f%% of cells have existing pipes (need >=5%%)",
|
"only %.1f%% of cells have existing pipes (need >=1%%)",
|
||||||
session_id, pipe_ratio * 100,
|
session_id, pipe_ratio * 100,
|
||||||
)
|
)
|
||||||
return 0
|
return 0
|
||||||
|
|||||||
Reference in New Issue
Block a user