fix(agent): bump _fetch_text timeout 60s->180s
The dsi-discovery in consent-tester does self-extraction + follows up to 3 sub-links + waits for CMP JSON payloads. On big SPAs (BMW, Daimler) this routinely exceeds 60s. When it timed out, the HTTP fallback returned the SSR shell as text — for the BMW cookie page that's 603 words of site navigation, which then registered as 'Cookie-Richtlinie nicht im eingereichten Text' (33%). With 180s the consent-tester finishes cleanly and we get the CMP-captured 1824 words of real policy.
This commit is contained in:
@@ -408,13 +408,17 @@ async def _fetch_text(url: str) -> str:
|
|||||||
1. Try consent-tester (Playwright) — handles JS-heavy SPAs
|
1. Try consent-tester (Playwright) — handles JS-heavy SPAs
|
||||||
2. Fallback: direct HTTP fetch + HTML strip — fast, works for SSR pages
|
2. Fallback: direct HTTP fetch + HTML strip — fast, works for SSR pages
|
||||||
"""
|
"""
|
||||||
# 1. Consent-tester (Playwright-based, full JS rendering)
|
# 1. Consent-tester (Playwright-based, full JS rendering).
|
||||||
|
# Timeout 180s: a single dsi-discovery does self-extraction + follows up
|
||||||
|
# to 3 sub-links + waits for CMP JSON payloads. 60s was tight enough that
|
||||||
|
# cookie-policy pages on big SPAs (BMW, Daimler) timed out and fell back
|
||||||
|
# to the raw HTTP fetch, which returned site navigation as garbage text.
|
||||||
try:
|
try:
|
||||||
async with httpx.AsyncClient(timeout=60.0) as client:
|
async with httpx.AsyncClient(timeout=180.0) as client:
|
||||||
resp = await client.post(
|
resp = await client.post(
|
||||||
f"{CONSENT_TESTER_URL}/dsi-discovery",
|
f"{CONSENT_TESTER_URL}/dsi-discovery",
|
||||||
json={"url": url, "max_documents": 3},
|
json={"url": url, "max_documents": 3},
|
||||||
timeout=60.0,
|
timeout=180.0,
|
||||||
)
|
)
|
||||||
if resp.status_code == 200:
|
if resp.status_code == 200:
|
||||||
docs = resp.json().get("documents", [])
|
docs = resp.json().get("documents", [])
|
||||||
|
|||||||
Reference in New Issue
Block a user