fix: extend tiny symbol filter to all non-black colors, raise area to 200
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 25s
CI / test-go-edu-search (push) Successful in 25s
CI / test-python-klausur (push) Failing after 1m49s
CI / test-python-agent-core (push) Successful in 15s
CI / test-nodejs-website (push) Successful in 17s
Step 5i rule (a) only caught blue tiny symbols. Graphic fragments from page illustrations (e.g. orange quote mark from man illustration) were missed. Now filters any non-black colored word_box with area < 200 and confidence < 85. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -2292,7 +2292,7 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
|
||||
# OCR reads these as text artifacts (©, e, *, or even plausible words
|
||||
# like "fighily" overlapping the real word "tightly").
|
||||
# Detection rules:
|
||||
# a) Tiny blue symbols: area < 150 AND conf < 85
|
||||
# a) Tiny coloured symbols: area < 200 AND conf < 85 (any non-black)
|
||||
# b) Overlapping word_boxes: >40% x-overlap → remove lower confidence
|
||||
# c) Duplicate text: consecutive blue wbs with identical text, gap < 6px
|
||||
bullet_removed = 0
|
||||
@@ -2303,10 +2303,11 @@ async def _build_grid_core(session_id: str, session: dict) -> dict:
|
||||
continue
|
||||
to_remove: set = set()
|
||||
|
||||
# Rule (a): tiny blue symbols
|
||||
# Rule (a): tiny coloured symbols (bullets, graphic fragments)
|
||||
for i, wb in enumerate(wbs):
|
||||
if (wb.get("color_name") == "blue"
|
||||
and wb.get("width", 0) * wb.get("height", 0) < 150
|
||||
cn = wb.get("color_name", "black")
|
||||
if (cn != "black"
|
||||
and wb.get("width", 0) * wb.get("height", 0) < 200
|
||||
and wb.get("conf", 100) < 85):
|
||||
to_remove.add(i)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user