fix(agents): HTML-Entity-Decode vor Agent + Pattern duldet '('

Bug bei BMW: dsi-discovery liefert HTML-Entities ( ) als Literal-Strings ohne Decode. Beispiel im BMW-Impressum: 'wird gesetzlich durch den Vorstand (Milan Nedeljkovic, …)' Mein Pattern erwartet ':' / '.' / Whitespace nach Vorstand → matched nicht das '&' → false-positive HIGH-Finding. Fix 1 (Hauptfix): Test-Harness ruft html.unescape() vor agent.evaluate() auf, so dass jeder Agent sauberen Text bekommt — entkoppelt von dsi-discovery-Eigenarten. Fix 2 (Belt-and-suspenders): Pattern duldet jetzt auch '(' direkt nach Vorstand/Geschaeftsfuehrer (falls Decode mal fehlschlaegt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 18:45:37 +02:00
parent 361a5e7605
commit 593baace7c
2 changed files with 7 additions and 1 deletions
@@ -14,6 +14,7 @@ Endpoints:
 from __future__ import annotations

 import asyncio
+import html as html_lib
 import json
 import logging
 import uuid
@@ -244,6 +245,11 @@ async def _process_slot(
                "error": fetch_err,
            })
    if text:
+        # HTML-Entity-Decode: dsi-discovery liefert manchmal &nbsp; / &amp;
+        # / &auml; als Literal-String — der Agent würde regex-pattern
+        # darüber stolpern. Wir decoden VOR dem Vault-Dump so dass der
+        # raw_text auch lesbar bleibt.
+        text = html_lib.unescape(text)
        vault.put_bytes("raw", slot, "source.txt",
                         text.encode("utf-8"),
                         mime="text/plain")
@@ -108,7 +108,7 @@ MCS: tuple[MC, ...] = (
                r"Vertretungsberechtigt|vertreten\s+durch|"
                r"Inhaber(?:in)?|"
                r"Pers(?:ö|oe)nlich\s+haftend)"
-                r"\s*[:.\s]",
+                r"\s*[:.\s(]",
                re.IGNORECASE,
            ),
            re.compile(r"\bManagement\s*[:.\s]\s*[A-ZÄÖÜ]",