fix: text extraction 50k char limit was root cause of all Spiegel FNs

ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars. Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted. DSB, Art. 77, Betroffenenrechte were all in the truncated portion. Fixes: 1. Raise text limit from 50k to 200k chars in API response + discovery 2. click_button(): add iframe fallback for Sourcepoint/Quantcast 3. dsi_helpers: iterate ALL page.frames for consent buttons 4. Profiler: only check impressum (not full text) for regulated professions, and "rechtsanwalt" must be in first 500 chars (company description) 5. GT: save full Spiegel DSI text (13,705 words) as reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:22:38 +02:00
parent 64e3a47b8c
commit 5e317d2f0f
5 changed files with 694 additions and 3 deletions
@@ -335,7 +335,7 @@ async def dsi_discovery(req: DSIDiscoveryRequest):
                doc_type=d.doc_type,
                word_count=d.word_count,
                text_preview=d.text[:500] if d.text else "",
-                full_text=d.text[:50000] if d.text else "",
+                full_text=d.text[:200000] if d.text else "",
            )
            for d in result.documents
        ],
@@ -417,7 +417,7 @@ async def discover_dsi_documents(
                        title=title, url=href, source_url=url,
                        language=lang,
                        doc_type="cross_domain" if not _is_allowed_domain(href, base_domain) else "html_page",
-                        text=text[:50000], word_count=len(text.split()),
+                        text=text[:200000], word_count=len(text.split()),
                    ))

                # Recursive: search THIS page for more DSI links