fix(consent): add Sourcepoint iframe handler + banner_detector fallback

Root cause: Spiegel DSI text was truncated because Sourcepoint consent wall was not dismissed — dsi_helpers.py had no Sourcepoint handler. Fixes: 1. Add Sourcepoint iframe click (frame_locator + .sp_choice_type_11) 2. Add banner_detector fallback (reuses 30 CMP selectors from scanner) 3. After banner dismiss, wait and re-navigate if page redirected 4. Add "Zustimmen und weiter" to generic text button list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 10:12:50 +02:00
parent 733d2bcc7b
commit b2c1f0ae84
2 changed files with 42 additions and 3 deletions
@@ -249,7 +249,17 @@ async def discover_dsi_documents(
        # Step 1b: Try dismissing cookie consent banners before extraction.
        # Many German sites (dm.de, Zalando, etc.) block page content behind
        # a consent wall. Dismissing it reveals the actual DSI text.
-        await try_dismiss_consent_banner(page)
+        banner_dismissed = await try_dismiss_consent_banner(page)
+        if banner_dismissed:
+            # After consent, page may reload or reveal hidden content
+            await page.wait_for_timeout(2000)
+            # Re-navigate if the page redirected after consent
+            try:
+                if page.url != url:
+                    await goto_resilient(page, url, timeout=30000)
+                    await page.wait_for_timeout(2000)
+            except Exception:
+                pass

        # Step 1c: Self-extraction — if the URL itself is a DSI page,
        # extract its full text as the first document. This handles the