fix: OCR-Artefakte (|, >) vor Cluster-Matching zusammenfuehren
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m23s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 22s
Some checks failed
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-school (push) Successful in 34s
CI / test-go-edu-search (push) Successful in 31s
CI / test-python-klausur (push) Failing after 2m23s
CI / test-python-agent-core (push) Successful in 22s
CI / test-nodejs-website (push) Successful in 22s
Box-Rahmen werden vom OCR als einzelne Symbole wie "|" oder ">" erkannt und als eigene Text-Gruppen behandelt. Das verfaelscht die Cluster-Zuordnung weil diese Artefakte entweder keinen eigenen Cluster erzeugen oder den falschen Cluster zugewiesen bekommen. Fix: Gruppen mit max 2 Zeichen ohne Buchstaben/Ziffern werden mit der benachbarten Gruppe zusammengefuehrt bevor die Cluster-Zuordnung laeuft. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -59,7 +59,28 @@ export function usePixelWordPositions(
|
||||
for (const cell of cells) {
|
||||
if (!cell.bbox_pct || !cell.text) continue
|
||||
|
||||
const groups = cell.text.split(/\s{3,}/).map(s => s.trim()).filter(Boolean)
|
||||
const rawGroups = cell.text.split(/\s{3,}/).map(s => s.trim()).filter(Boolean)
|
||||
|
||||
// Merge single-char symbol groups (OCR artifacts from box borders like "|", ">")
|
||||
// with their neighbour to avoid polluting the cluster-to-group matching
|
||||
const groups: string[] = []
|
||||
for (let gi = 0; gi < rawGroups.length; gi++) {
|
||||
const g = rawGroups[gi]
|
||||
const isArtifact = g.length <= 2 && !/[a-zA-Z0-9\u00C0-\u024F]/.test(g)
|
||||
if (isArtifact) {
|
||||
if (gi + 1 < rawGroups.length) {
|
||||
// merge with next group
|
||||
rawGroups[gi + 1] = g + ' ' + rawGroups[gi + 1]
|
||||
} else if (groups.length > 0) {
|
||||
// last group — merge with previous
|
||||
groups[groups.length - 1] += ' ' + g
|
||||
} else {
|
||||
groups.push(g)
|
||||
}
|
||||
} else {
|
||||
groups.push(g)
|
||||
}
|
||||
}
|
||||
|
||||
let cx: number, cy: number
|
||||
const cw = Math.round(cell.bbox_pct.w / 100 * imgW)
|
||||
|
||||
Reference in New Issue
Block a user