Commit Graph

3 Commits

Author SHA1 Message Date
Benjamin Admin 8283483909 feat(consent-tester): Phase A — generic JSON cookie-policy heuristic
New module cmp_heuristic.py with:
- looks_like_cookie_policy(data): shape-based classifier (top-level keys
  cookies/categories/providers/vendors/purposes/cookieList/etc. + at
  least 2 name+description objects, or IAB TCF v2 vendors[]+purposes[])
- reconstruct_generic(data): walks JSON, extracts name + description
  fields + standalone prologue/dataController/persistence fields,
  emits flat German Markdown text (max 5000 words, dedup)

cmp_extractor.py wired so that AFTER named CMP matchers (epaas,
onetrust) fail, every JSON response on the page is tested for the
heuristic. If matched, payload is captured as '_heuristic' kind and
reconstructed via the generic walker.

This is Phase A of the 4-stage cascade (B-D follow). Unknown CMPs that
return JSON now work without hand-coding each one.

Pre-filter: skips response paths /api/config, /beacon, /track,
/analytics, /fonts/, /log/, /heartbeat/, /.well-known/ to avoid
spamming the heuristic on every Playwright load.
2026-05-16 22:56:20 +02:00
Benjamin Admin 938f9a6c51 fix(cmp): tolerate variable URL segments in ePaaS policy pattern
BMW ePaaS URLs use 3 segments between /policypage/ and .epaas.json:
  /epaas/prod/policypage/<tenant>/<config-hash>/<locale>.epaas.json
The old pattern only matched 2 segments. Switch to a tolerant pattern
that matches any path before .epaas.json (anchored at .epaas.json end).
2026-05-16 20:58:48 +02:00
Benjamin Admin 1792c6f896 fix(consent-tester): capture CMP JSON to extract dynamically-loaded cookie policies
BMW (and other big enterprise sites) do NOT render cookie policies as
static HTML. Their widget loads structured data from a JSON endpoint
(BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and
renders it client-side after consent. Our DOM extraction therefore only
captured site navigation (603 words of header/footer chrome), not the
actual policy.

New module consent-tester/services/cmp_extractor.py:
- CMPCapture: response listener that catches policy JSON during navigation
- Reconstructors for ePaaS (BMW) + OneTrust placeholder
- Returns Cookie-Richtlinie text built from policyPageMetadata +
  categories + providers (BMW: 1673 words reconstructed vs. 603 noise)

dsi_discovery.py:
- Attach CMPCapture before page.goto
- After self-extraction: if rendered DOM < 300 words AND CMP captured a
  payload, prefer the CMP-reconstructed text. This bypasses the empty
  '.cookie-policy' div problem entirely.
2026-05-16 20:50:15 +02:00