New module cmp_heuristic.py with:
- looks_like_cookie_policy(data): shape-based classifier (top-level keys
cookies/categories/providers/vendors/purposes/cookieList/etc. + at
least 2 name+description objects, or IAB TCF v2 vendors[]+purposes[])
- reconstruct_generic(data): walks JSON, extracts name + description
fields + standalone prologue/dataController/persistence fields,
emits flat German Markdown text (max 5000 words, dedup)
cmp_extractor.py wired so that AFTER named CMP matchers (epaas,
onetrust) fail, every JSON response on the page is tested for the
heuristic. If matched, payload is captured as '_heuristic' kind and
reconstructed via the generic walker.
This is Phase A of the 4-stage cascade (B-D follow). Unknown CMPs that
return JSON now work without hand-coding each one.
Pre-filter: skips response paths /api/config, /beacon, /track,
/analytics, /fonts/, /log/, /heartbeat/, /.well-known/ to avoid
spamming the heuristic on every Playwright load.
BMW ePaaS URLs use 3 segments between /policypage/ and .epaas.json:
/epaas/prod/policypage/<tenant>/<config-hash>/<locale>.epaas.json
The old pattern only matched 2 segments. Switch to a tolerant pattern
that matches any path before .epaas.json (anchored at .epaas.json end).
BMW (and other big enterprise sites) do NOT render cookie policies as
static HTML. Their widget loads structured data from a JSON endpoint
(BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and
renders it client-side after consent. Our DOM extraction therefore only
captured site navigation (603 words of header/footer chrome), not the
actual policy.
New module consent-tester/services/cmp_extractor.py:
- CMPCapture: response listener that catches policy JSON during navigation
- Reconstructors for ePaaS (BMW) + OneTrust placeholder
- Returns Cookie-Richtlinie text built from policyPageMetadata +
categories + providers (BMW: 1673 words reconstructed vs. 603 noise)
dsi_discovery.py:
- Attach CMPCapture before page.goto
- After self-extraction: if rendered DOM < 300 words AND CMP captured a
payload, prefer the CMP-reconstructed text. This bypasses the empty
'.cookie-policy' div problem entirely.