fix(cookie-extract): max_documents=1 + faster networkidle bail (Phase 0 fix)

Root cause of the recurring 603-word BMW result:
- DSI discovery for cookie-policy URL was hitting 4x networkidle timeouts
  (60s each = ~240s total).
- Backend httpx timeout (180s after the previous fix) gave up before the
  consent-tester finished, falling through to the raw HTTP fetch which
  returned BMWs SSR navigation chrome (603 words) as the 'cookie policy'.

Two orthogonal fixes:
1. _fetch_text now passes max_documents=1 for user-specified URLs. We only
   want self-extraction of THAT page; link-following is unnecessary noise.
2. networkidle wait_until window dropped 60s -> 15s. SPAs like BMW/Daimler
   never reach networkidle anyway; the 60s wait was pure latency. Falls
   through to domcontentloaded+5s render-wait, same as before.
This commit is contained in:
Benjamin Admin
2026-05-16 22:53:23 +02:00
parent 69729ef6ac
commit 9814b56f2f
2 changed files with 15 additions and 12 deletions
+7 -5
View File
@@ -14,14 +14,16 @@ logger = logging.getLogger(__name__)
async def goto_resilient(page: Page, url: str, timeout: int = 60000) -> None:
"""Navigate to URL with fallback: try networkidle first, then domcontentloaded.
SPAs like Zalando never reach networkidle because of continuous background
requests. Falling back to domcontentloaded + a short wait gives JS time to
render the main content without waiting for every network request to finish.
SPAs like Zalando, BMW, Daimler never reach networkidle because of continuous
background requests (analytics, lazy-loaded assets, polling). The 60s wait
for networkidle is essentially always a 60s waste on those. We try briefly
(15s) and fall through to domcontentloaded + a 5s render-wait.
"""
networkidle_timeout = min(timeout, 15000)
try:
await page.goto(url, wait_until="networkidle", timeout=timeout)
await page.goto(url, wait_until="networkidle", timeout=networkidle_timeout)
except PlaywrightTimeout:
logger.info("networkidle timeout for %s, falling back to domcontentloaded", url)
logger.debug("networkidle timeout for %s, falling back to domcontentloaded", url)
await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
await page.wait_for_timeout(5000) # extra wait for JS rendering