BMW ePaaS URLs use 3 segments between /policypage/ and .epaas.json:
/epaas/prod/policypage/<tenant>/<config-hash>/<locale>.epaas.json
The old pattern only matched 2 segments. Switch to a tolerant pattern
that matches any path before .epaas.json (anchored at .epaas.json end).
BMW (and other big enterprise sites) do NOT render cookie policies as
static HTML. Their widget loads structured data from a JSON endpoint
(BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and
renders it client-side after consent. Our DOM extraction therefore only
captured site navigation (603 words of header/footer chrome), not the
actual policy.
New module consent-tester/services/cmp_extractor.py:
- CMPCapture: response listener that catches policy JSON during navigation
- Reconstructors for ePaaS (BMW) + OneTrust placeholder
- Returns Cookie-Richtlinie text built from policyPageMetadata +
categories + providers (BMW: 1673 words reconstructed vs. 603 noise)
dsi_discovery.py:
- Attach CMPCapture before page.goto
- After self-extraction: if rendered DOM < 300 words AND CMP captured a
payload, prefer the CMP-reconstructed text. This bypasses the empty
'.cookie-policy' div problem entirely.