Lower syllable pipe-ratio threshold from 5% to 1%
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 24s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m58s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 16s

Real dictionary pages have only ~3% OCR-detected pipes because the thin
syllable divider lines are hard for OCR to read. The primary false-positive
guard (article_col_index check) already blocks synonym dictionaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-24 23:17:08 +01:00
parent ed7fc99fc4
commit 4feec7c7b7

View File

@@ -200,10 +200,9 @@ def insert_syllable_dividers(
For dictionary pages: process all content column cells, strip existing
pipes, merge pipe-gap spaces, and re-syllabify using pyphen.
Pre-check: at least 5% of content cells must already contain ``|`` from
OCR. This guards against false-positive dictionary detection on pages
like synonym dictionaries or alphabetical word lists that have no actual
syllable divider lines.
Pre-check: at least 1% of content cells must already contain ``|`` from
OCR. This guards against pages with zero pipe characters (the primary
guard — article_col_index — is checked at the call site).
Returns the number of cells modified.
"""
@@ -227,10 +226,10 @@ def insert_syllable_dividers(
if total_col_cells > 0:
pipe_ratio = cells_with_pipes / total_col_cells
if pipe_ratio < 0.05:
if pipe_ratio < 0.01:
logger.info(
"build-grid session %s: skipping syllable insertion — "
"only %.1f%% of cells have existing pipes (need >=5%%)",
"only %.1f%% of cells have existing pipes (need >=1%%)",
session_id, pipe_ratio * 100,
)
return 0