fix: extend tiny symbol filter to all non-black colors, raise area to 200
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s

Step 5i rule (a) only caught blue tiny symbols. Graphic fragments from
page illustrations (e.g. orange quote mark from man illustration) were
missed. Now filters any non-black colored word_box with area < 200 and
confidence < 85.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-03-21 18:05:31 +01:00
parent 2acf8696bf
commit 4000110501

View File

@@ -2292,7 +2292,7 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
# OCR reads these as text artifacts (©, e, *, or even plausible words
# like "fighily" overlapping the real word "tightly").
# Detection rules:
# a) Tiny blue symbols: area < 150 AND conf < 85
# a) Tiny coloured symbols: area < 200 AND conf < 85 (any non-black)
# b) Overlapping word_boxes: >40% x-overlap → remove lower confidence
# c) Duplicate text: consecutive blue wbs with identical text, gap < 6px
bullet_removed = 0
@@ -2303,10 +2303,11 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
continue
to_remove: set = set()
# Rule (a): tiny blue symbols
# Rule (a): tiny coloured symbols (bullets, graphic fragments)
for i, wb in enumerate(wbs):
if (wb.get("color_name") == "blue"
and wb.get("width", 0) * wb.get("height", 0) < 150
cn = wb.get("color_name", "black")
if (cn != "black"
and wb.get("width", 0) * wb.get("height", 0) < 200
and wb.get("conf", 100) < 85):
to_remove.add(i)