fix: 5 regex bugs + text extraction scroll + GT update

Root cause: Spiegel DSI text was truncated (lazy-loading) — the rights/DSB/complaints sections at the bottom were never extracted. Fixes: 1. Text extraction: scroll to bottom before innerText (dsi_discovery.py) 2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern 3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces) 4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc. 5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media rows from sections found in the DSI text Ground Truth 06-spiegel.md fully rewritten with verified data from live website — 3 L1 False Negatives identified (DSB, Beschwerderecht, Betroffenenrechte all present on website but not in extracted text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 01:20:55 +02:00
parent 8bb90d73e5
commit c702260ec1
6 changed files with 194 additions and 78 deletions
@@ -135,8 +135,9 @@ IMPRESSUM_CHECKLIST = [
        "label": "USt-IdNr.",
        "level": 1, "parent": None,
        "patterns": [
-            r"ust.*id", r"umsatzsteuer.*identifikation",
-            r"vat.*id", r"de\s*\d{9}",
+            r"ust[\s.-]*id", r"umsatzsteuer[\s-]*id",
+            r"umsatzsteuer.*identifikation",
+            r"vat[\s.-]*id", r"de\s*\d{3}\s*\d{3}\s*\d{3}",
        ],
        "severity": "MEDIUM",
        "hint": "§5(1) Nr.6 TMG: Die USt-IdNr. muss angegeben werden, sofern vorhanden. Die Steuernummer ist KEIN Ersatz.",
@@ -146,7 +147,7 @@ IMPRESSUM_CHECKLIST = [
        "label": "USt-IdNr. im Format DE + 9 Ziffern",
        "level": 2, "parent": "vat",
        "patterns": [
-            r"de\s*\d{9}",
+            r"de\s*\d{3}\s*\d{3}\s*\d{3}",
        ],
        "severity": "LOW",
        "hint": "Deutsche USt-IdNr.: 'DE' + exakt 9 Ziffern (z.B. DE123456789). Validierung: https://evatr.bff-online.de/",
@@ -187,7 +188,8 @@ IMPRESSUM_CHECKLIST = [
        "patterns": [
            r"v\.?\s*i\.?\s*s\.?\s*d\.?\s*p",
            r"(?:redaktionell|inhaltlich)\s+verantwortlich",
-            r"§\s*18\s+m(?:edien)?st(?:aat)?v",
+            r"§\s*18\s+(?:abs\.?\s*\d+\s+)?m(?:edien)?st(?:aat)?v",
+            r"verantwortlich\w*\s+i\.?\s*s\.?\s*(?:d\.?\s*)?v\.?",
        ],
        "severity": "INFO",
        "hint": "Nur relevant wenn die Website journalistisch-redaktionelle Inhalte hat (Blog, Ratgeber, News, Fachartikel). Reine Unternehmensseiten ohne redaktionelle Inhalte benoetigen keinen V.i.S.d.P. Pruefen Sie, ob die Website einen Blog oder Ratgeber-Bereich hat.",