fix(consent): add Sourcepoint iframe handler + banner_detector fallback
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m1s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 57s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 25s
CI / validate-canonical-controls (push) Successful in 15s

Root cause: Spiegel DSI text was truncated because Sourcepoint consent
wall was not dismissed — dsi_helpers.py had no Sourcepoint handler.

Fixes:
1. Add Sourcepoint iframe click (frame_locator + .sp_choice_type_11)
2. Add banner_detector fallback (reuses 30 CMP selectors from scanner)
3. After banner dismiss, wait and re-navigate if page redirected
4. Add "Zustimmen und weiter" to generic text button list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-13 10:12:50 +02:00
parent 733d2bcc7b
commit b2c1f0ae84
2 changed files with 42 additions and 3 deletions
+11 -1
View File
@@ -249,7 +249,17 @@ async def discover_dsi_documents(
# Step 1b: Try dismissing cookie consent banners before extraction.
# Many German sites (dm.de, Zalando, etc.) block page content behind
# a consent wall. Dismissing it reveals the actual DSI text.
await try_dismiss_consent_banner(page)
banner_dismissed = await try_dismiss_consent_banner(page)
if banner_dismissed:
# After consent, page may reload or reveal hidden content
await page.wait_for_timeout(2000)
# Re-navigate if the page redirected after consent
try:
if page.url != url:
await goto_resilient(page, url, timeout=30000)
await page.wait_for_timeout(2000)
except Exception:
pass
# Step 1c: Self-extraction — if the URL itself is a DSI page,
# extract its full text as the first document. This handles the