fix: 5 regex bugs + text extraction scroll + GT update

Root cause: Spiegel DSI text was truncated (lazy-loading) — the rights/DSB/complaints sections at the bottom were never extracted. Fixes: 1. Text extraction: scroll to bottom before innerText (dsi_discovery.py) 2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern 3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces) 4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc. 5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media rows from sections found in the DSI text Ground Truth 06-spiegel.md fully rewritten with verified data from live website — 3 L1 False Negatives identified (DSB, Beschwerderecht, Betroffenenrechte all present on website but not in extracted text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 01:20:55 +02:00
parent 8bb90d73e5
commit c702260ec1
6 changed files with 194 additions and 78 deletions
@@ -174,9 +174,14 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
                "word_count": len(text.split()) if text else 0,
            })

-        # Step 1b: If same URL used for multiple doc_types, try section splitting
-        from compliance.services.section_splitter import split_shared_texts
+        # Step 1b: Section splitting — two cases:
+        # 1. Same URL used for multiple doc_types → split by heading
+        # 2. DSI text contains Cookie/Social-Media sections → auto-fill empty rows
+        from compliance.services.section_splitter import (
+            split_shared_texts, auto_fill_from_dsi,
+        )
        split_shared_texts(doc_entries, url_text_cache)
+        auto_fill_from_dsi(doc_entries)
        # Refresh doc_texts after splitting
        for entry in doc_entries:
            if entry.get("text"):