fix(cookie-extract): max_documents=1 + faster networkidle bail (Phase 0 fix)

Root cause of the recurring 603-word BMW result:
- DSI discovery for cookie-policy URL was hitting 4x networkidle timeouts
  (60s each = ~240s total).
- Backend httpx timeout (180s after the previous fix) gave up before the
  consent-tester finished, falling through to the raw HTTP fetch which
  returned BMWs SSR navigation chrome (603 words) as the 'cookie policy'.

Two orthogonal fixes:
1. _fetch_text now passes max_documents=1 for user-specified URLs. We only
   want self-extraction of THAT page; link-following is unnecessary noise.
2. networkidle wait_until window dropped 60s -> 15s. SPAs like BMW/Daimler
   never reach networkidle anyway; the 60s wait was pure latency. Falls
   through to domcontentloaded+5s render-wait, same as before.
This commit is contained in:
Benjamin Admin
2026-05-16 22:53:23 +02:00
parent 69729ef6ac
commit 9814b56f2f
2 changed files with 15 additions and 12 deletions
@@ -409,16 +409,17 @@ async def _fetch_text(url: str) -> str:
2. Fallback: direct HTTP fetch + HTML strip — fast, works for SSR pages
"""
# 1. Consent-tester (Playwright-based, full JS rendering).
# Timeout 180s: a single dsi-discovery does self-extraction + follows up
# to 3 sub-links + waits for CMP JSON payloads. 60s was tight enough that
# cookie-policy pages on big SPAs (BMW, Daimler) timed out and fell back
# to the raw HTTP fetch, which returned site navigation as garbage text.
# max_documents=1: for a *specific* user-entered URL (cookie, impressum,
# privacy) we only want the self-extracted text of THAT page. Following
# sub-links was triggering 4x networkidle timeouts (~240s) and made the
# backend httpx call time out, dropping us to the raw HTTP fallback
# which returned site navigation as garbage text.
try:
async with httpx.AsyncClient(timeout=180.0) as client:
async with httpx.AsyncClient(timeout=120.0) as client:
resp = await client.post(
f"{CONSENT_TESTER_URL}/dsi-discovery",
json={"url": url, "max_documents": 3},
timeout=180.0,
json={"url": url, "max_documents": 1},
timeout=120.0,
)
if resp.status_code == 200:
docs = resp.json().get("documents", [])