fix(cookie-extract): max_documents=1 + faster networkidle bail (Phase 0 fix)
Root cause of the recurring 603-word BMW result: - DSI discovery for cookie-policy URL was hitting 4x networkidle timeouts (60s each = ~240s total). - Backend httpx timeout (180s after the previous fix) gave up before the consent-tester finished, falling through to the raw HTTP fetch which returned BMWs SSR navigation chrome (603 words) as the 'cookie policy'. Two orthogonal fixes: 1. _fetch_text now passes max_documents=1 for user-specified URLs. We only want self-extraction of THAT page; link-following is unnecessary noise. 2. networkidle wait_until window dropped 60s -> 15s. SPAs like BMW/Daimler never reach networkidle anyway; the 60s wait was pure latency. Falls through to domcontentloaded+5s render-wait, same as before.
This commit is contained in:
@@ -409,16 +409,17 @@ async def _fetch_text(url: str) -> str:
|
|||||||
2. Fallback: direct HTTP fetch + HTML strip — fast, works for SSR pages
|
2. Fallback: direct HTTP fetch + HTML strip — fast, works for SSR pages
|
||||||
"""
|
"""
|
||||||
# 1. Consent-tester (Playwright-based, full JS rendering).
|
# 1. Consent-tester (Playwright-based, full JS rendering).
|
||||||
# Timeout 180s: a single dsi-discovery does self-extraction + follows up
|
# max_documents=1: for a *specific* user-entered URL (cookie, impressum,
|
||||||
# to 3 sub-links + waits for CMP JSON payloads. 60s was tight enough that
|
# privacy) we only want the self-extracted text of THAT page. Following
|
||||||
# cookie-policy pages on big SPAs (BMW, Daimler) timed out and fell back
|
# sub-links was triggering 4x networkidle timeouts (~240s) and made the
|
||||||
# to the raw HTTP fetch, which returned site navigation as garbage text.
|
# backend httpx call time out, dropping us to the raw HTTP fallback
|
||||||
|
# which returned site navigation as garbage text.
|
||||||
try:
|
try:
|
||||||
async with httpx.AsyncClient(timeout=180.0) as client:
|
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||||
resp = await client.post(
|
resp = await client.post(
|
||||||
f"{CONSENT_TESTER_URL}/dsi-discovery",
|
f"{CONSENT_TESTER_URL}/dsi-discovery",
|
||||||
json={"url": url, "max_documents": 3},
|
json={"url": url, "max_documents": 1},
|
||||||
timeout=180.0,
|
timeout=120.0,
|
||||||
)
|
)
|
||||||
if resp.status_code == 200:
|
if resp.status_code == 200:
|
||||||
docs = resp.json().get("documents", [])
|
docs = resp.json().get("documents", [])
|
||||||
|
|||||||
@@ -14,14 +14,16 @@ logger = logging.getLogger(__name__)
|
|||||||
async def goto_resilient(page: Page, url: str, timeout: int = 60000) -> None:
|
async def goto_resilient(page: Page, url: str, timeout: int = 60000) -> None:
|
||||||
"""Navigate to URL with fallback: try networkidle first, then domcontentloaded.
|
"""Navigate to URL with fallback: try networkidle first, then domcontentloaded.
|
||||||
|
|
||||||
SPAs like Zalando never reach networkidle because of continuous background
|
SPAs like Zalando, BMW, Daimler never reach networkidle because of continuous
|
||||||
requests. Falling back to domcontentloaded + a short wait gives JS time to
|
background requests (analytics, lazy-loaded assets, polling). The 60s wait
|
||||||
render the main content without waiting for every network request to finish.
|
for networkidle is essentially always a 60s waste on those. We try briefly
|
||||||
|
(15s) and fall through to domcontentloaded + a 5s render-wait.
|
||||||
"""
|
"""
|
||||||
|
networkidle_timeout = min(timeout, 15000)
|
||||||
try:
|
try:
|
||||||
await page.goto(url, wait_until="networkidle", timeout=timeout)
|
await page.goto(url, wait_until="networkidle", timeout=networkidle_timeout)
|
||||||
except PlaywrightTimeout:
|
except PlaywrightTimeout:
|
||||||
logger.info("networkidle timeout for %s, falling back to domcontentloaded", url)
|
logger.debug("networkidle timeout for %s, falling back to domcontentloaded", url)
|
||||||
await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
|
||||||
await page.wait_for_timeout(5000) # extra wait for JS rendering
|
await page.wait_for_timeout(5000) # extra wait for JS rendering
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user