fix: hard-filter OCR words inside detected graphic regions
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 26s
CI / test-go-edu-search (push) Successful in 27s
CI / test-python-klausur (push) Failing after 1m51s
CI / test-python-agent-core (push) Successful in 18s
CI / test-nodejs-website (push) Successful in 16s
Run detect_graphic_elements() in the grid pipeline after image loading and remove ALL words whose centroids fall inside detected graphic regions, regardless of confidence. Previously only low-confidence words (conf < 50) were removed, letting artifacts like "Tr", "Su" survive. Changes: - grid_editor_api.py: Import and call detect_graphic_elements() at Step 3a, passing only significant words (len >= 3) to avoid short artifacts fooling the text-vs-graphic heuristic. Hard-filter all words in graphic regions. - cv_graphic_detect.py: Lower density threshold from 20% to 5% for large regions (>100x80px) — photos/illustrations have low color saturation. Raise page-spanning limit from 50% to 60% width/height. Tested: 5 ground-truth sessions pass regression (079cd0d9, d8533a2c, 2838c7a7, 4233d7e3, 5997b635). Session 5997 now detects 2 graphic regions and removes 29 artifact words including "Tr" and "Su". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -170,7 +170,7 @@ def detect_graphic_elements(
|
||||
continue
|
||||
|
||||
# Skip page-spanning regions
|
||||
if bw > w * 0.5 or bh > h * 0.5:
|
||||
if bw > w * 0.6 or bh > h * 0.6:
|
||||
logger.debug("GraphicDetect PASS1 skip page-spanning (%d,%d) %dx%d", bx, by, bw, bh)
|
||||
continue
|
||||
|
||||
@@ -232,12 +232,16 @@ def detect_graphic_elements(
|
||||
if color_pixel_count < 200:
|
||||
continue
|
||||
|
||||
# (d) Very low density → thin strokes, almost certainly text
|
||||
if density < 0.20:
|
||||
# (d) Very low density → thin strokes, almost certainly text.
|
||||
# Large regions (photos/illustrations) can have low color density
|
||||
# because most pixels are grayscale ink. Use a lower threshold
|
||||
# for regions bigger than 100×80 px.
|
||||
_min_density = 0.05 if (bw > 100 and bh > 80) else 0.20
|
||||
if density < _min_density:
|
||||
logger.info(
|
||||
"GraphicDetect PASS1 skip low-density (%d,%d) %dx%d "
|
||||
"density=%.0f%% (likely colored text)",
|
||||
bx, by, bw, bh, density * 100,
|
||||
"density=%.0f%% (min=%.0f%%, likely colored text)",
|
||||
bx, by, bw, bh, density * 100, _min_density * 100,
|
||||
)
|
||||
continue
|
||||
|
||||
|
||||
Reference in New Issue
Block a user