Two bugs observed in BMW BMW test run:
1. Generic JSON heuristic captured /de-de/login/bmw/api/flyout/data (4KB,
user login fly-out data) and reconstruct_generic produced 56 words of
noise. The CMP-prefer logic then 'replaced' the 185-word imprint DOM
extraction with those 56 words because self_wc(185) < 300 — even
though cmp_wc(56) < self_wc(185).
2. The strict prefilter list was too short. Login/auth/cart endpoints
often have category-shaped JSON without being cookie policies.
Fixes:
- dsi_discovery: replace DOM with CMP only when cmp_wc > self_wc AND
meets one of the existing conditions. Tiny captures can no longer
silently destroy a bigger DOM extraction.
- cmp_extractor: skip non-cookie URLs (/login, /auth, /user, /session,
/cart, /checkout, /search, /flyout, /menu, /nav, /translation, /i18n,
/locale, /feature-flag).
- cmp_extractor: require ≥5KB payload size — real CMP policies are
always larger (BMW ePaaS is ~393KB). Tiny matches drop out before
reconstruction.
cmp_extractor.py refactored to thin coordinator (123 LOC, was 223).
Discovers all CMP modules via cmp_library/_registry.py:load_all() at
import time. Restart consent-tester to pick up new modules.
New cmp_library/ folder:
- _registry.py: auto-discovers all modules with MATCHER + reconstruct()
- epaas.py: BMW Group ePaaS (extracted from cmp_extractor)
- onetrust.py: cdn.cookielaw.org Groups/Cookies schema
- cookiebot.py: consent.cookiebot.com Categories schema
- usercentrics.py: api.usercentrics.eu services schema
- didomi.py: sdk.privacy-center.org notice + vendors + purposes
- trustarc.py: consent.trustarc.com categories + vendors
Each module:
- MATCHER: re.Pattern matching the CMP JSON endpoint URL
- reconstruct(d: dict) -> str: builds German Markdown cookie-policy text
Phase E (self-improving) will write auto_*.py files into the same folder;
_registry already picks those up via pkgutil.iter_modules.
New module cmp_heuristic.py with:
- looks_like_cookie_policy(data): shape-based classifier (top-level keys
cookies/categories/providers/vendors/purposes/cookieList/etc. + at
least 2 name+description objects, or IAB TCF v2 vendors[]+purposes[])
- reconstruct_generic(data): walks JSON, extracts name + description
fields + standalone prologue/dataController/persistence fields,
emits flat German Markdown text (max 5000 words, dedup)
cmp_extractor.py wired so that AFTER named CMP matchers (epaas,
onetrust) fail, every JSON response on the page is tested for the
heuristic. If matched, payload is captured as '_heuristic' kind and
reconstructed via the generic walker.
This is Phase A of the 4-stage cascade (B-D follow). Unknown CMPs that
return JSON now work without hand-coding each one.
Pre-filter: skips response paths /api/config, /beacon, /track,
/analytics, /fonts/, /log/, /heartbeat/, /.well-known/ to avoid
spamming the heuristic on every Playwright load.
BMW ePaaS URLs use 3 segments between /policypage/ and .epaas.json:
/epaas/prod/policypage/<tenant>/<config-hash>/<locale>.epaas.json
The old pattern only matched 2 segments. Switch to a tolerant pattern
that matches any path before .epaas.json (anchored at .epaas.json end).
BMW (and other big enterprise sites) do NOT render cookie policies as
static HTML. Their widget loads structured data from a JSON endpoint
(BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and
renders it client-side after consent. Our DOM extraction therefore only
captured site navigation (603 words of header/footer chrome), not the
actual policy.
New module consent-tester/services/cmp_extractor.py:
- CMPCapture: response listener that catches policy JSON during navigation
- Reconstructors for ePaaS (BMW) + OneTrust placeholder
- Returns Cookie-Richtlinie text built from policyPageMetadata +
categories + providers (BMW: 1673 words reconstructed vs. 603 noise)
dsi_discovery.py:
- Attach CMPCapture before page.goto
- After self-extraction: if rendered DOM < 300 words AND CMP captured a
payload, prefer the CMP-reconstructed text. This bypasses the empty
'.cookie-policy' div problem entirely.