fix: text extraction 50k char limit was root cause of all Spiegel FNs

ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars. Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted. DSB, Art. 77, Betroffenenrechte were all in the truncated portion. Fixes: 1. Raise text limit from 50k to 200k chars in API response + discovery 2. click_button(): add iframe fallback for Sourcepoint/Quantcast 3. dsi_helpers: iterate ALL page.frames for consent buttons 4. Profiler: only check impressum (not full text) for regulated professions, and "rechtsanwalt" must be in first 500 chars (company description) 5. GT: save full Spiegel DSI text (13,705 words) as reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:22:38 +02:00
parent 64e3a47b8c
commit 5e317d2f0f
5 changed files with 694 additions and 3 deletions
@@ -177,8 +177,20 @@ async def detect_business_profile(documents: dict[str, str]) -> BusinessProfile:
    profile.has_editorial_content = editorial_hits >= 2

    # ── Regulated profession ─────────────────────────────────────
+    # Only check impressum text (not full text) — keywords like "rechtsanwalt"
+    # appear as contact persons in DSI texts (e.g. Spiegel's "Rechtsanwalt Kruse")
+    # but that doesn't mean the company IS a law firm.
+    impressum_text = documents.get("impressum", "").lower().replace("\xad", "")
+    if not impressum_text:
+        impressum_text = full_text[:2000]  # Fallback: first 2000 chars
    for keyword, prof_type in _REGULATED_PROFESSIONS.items():
-        if keyword in full_text:
+        if keyword in impressum_text:
+            # Extra guard: "rechtsanwalt" must appear near the company description,
+            # not just as a contact person name
+            if keyword in ("rechtsanwalt", "rechtsanwaeltin", "rechtsanwältin"):
+                # Check if it's in the first 500 chars (company description area)
+                if keyword not in impressum_text[:500]:
+                    continue
            profile.is_regulated_profession = True
            profile.regulated_profession_type = prof_type
            break