fix(agents): HTML-Entity-Decode vor Agent + Pattern duldet '('
CI / detect-changes (push) Successful in 6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / detect-changes (push) Successful in 6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
Bug bei BMW: dsi-discovery liefert HTML-Entities ( ) als
Literal-Strings ohne Decode. Beispiel im BMW-Impressum:
'wird gesetzlich durch den Vorstand (Milan Nedeljkovic, …)'
Mein Pattern erwartet ':' / '.' / Whitespace nach Vorstand →
matched nicht das '&' → false-positive HIGH-Finding.
Fix 1 (Hauptfix): Test-Harness ruft html.unescape() vor agent.evaluate()
auf, so dass jeder Agent sauberen Text bekommt — entkoppelt von
dsi-discovery-Eigenarten.
Fix 2 (Belt-and-suspenders): Pattern duldet jetzt auch '(' direkt
nach Vorstand/Geschaeftsfuehrer (falls Decode mal fehlschlaegt).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -14,6 +14,7 @@ Endpoints:
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
|
import html as html_lib
|
||||||
import json
|
import json
|
||||||
import logging
|
import logging
|
||||||
import uuid
|
import uuid
|
||||||
@@ -244,6 +245,11 @@ async def _process_slot(
|
|||||||
"error": fetch_err,
|
"error": fetch_err,
|
||||||
})
|
})
|
||||||
if text:
|
if text:
|
||||||
|
# HTML-Entity-Decode: dsi-discovery liefert manchmal / &
|
||||||
|
# / ä als Literal-String — der Agent würde regex-pattern
|
||||||
|
# darüber stolpern. Wir decoden VOR dem Vault-Dump so dass der
|
||||||
|
# raw_text auch lesbar bleibt.
|
||||||
|
text = html_lib.unescape(text)
|
||||||
vault.put_bytes("raw", slot, "source.txt",
|
vault.put_bytes("raw", slot, "source.txt",
|
||||||
text.encode("utf-8"),
|
text.encode("utf-8"),
|
||||||
mime="text/plain")
|
mime="text/plain")
|
||||||
|
|||||||
@@ -108,7 +108,7 @@ MCS: tuple[MC, ...] = (
|
|||||||
r"Vertretungsberechtigt|vertreten\s+durch|"
|
r"Vertretungsberechtigt|vertreten\s+durch|"
|
||||||
r"Inhaber(?:in)?|"
|
r"Inhaber(?:in)?|"
|
||||||
r"Pers(?:ö|oe)nlich\s+haftend)"
|
r"Pers(?:ö|oe)nlich\s+haftend)"
|
||||||
r"\s*[:.\s]",
|
r"\s*[:.\s(]",
|
||||||
re.IGNORECASE,
|
re.IGNORECASE,
|
||||||
),
|
),
|
||||||
re.compile(r"\bManagement\s*[:.\s]\s*[A-ZÄÖÜ]",
|
re.compile(r"\bManagement\s*[:.\s]\s*[A-ZÄÖÜ]",
|
||||||
|
|||||||
Reference in New Issue
Block a user