fix(cmp): stricter heuristic + only replace DOM when CMP is strictly larger
Two bugs observed in BMW BMW test run: 1. Generic JSON heuristic captured /de-de/login/bmw/api/flyout/data (4KB, user login fly-out data) and reconstruct_generic produced 56 words of noise. The CMP-prefer logic then 'replaced' the 185-word imprint DOM extraction with those 56 words because self_wc(185) < 300 — even though cmp_wc(56) < self_wc(185). 2. The strict prefilter list was too short. Login/auth/cart endpoints often have category-shaped JSON without being cookie policies. Fixes: - dsi_discovery: replace DOM with CMP only when cmp_wc > self_wc AND meets one of the existing conditions. Tiny captures can no longer silently destroy a bigger DOM extraction. - cmp_extractor: skip non-cookie URLs (/login, /auth, /user, /session, /cart, /checkout, /search, /flyout, /menu, /nav, /translation, /i18n, /locale, /feature-flag). - cmp_extractor: require ≥5KB payload size — real CMP policies are always larger (BMW ePaaS is ~393KB). Tiny matches drop out before reconstruction.
This commit is contained in:
@@ -71,18 +71,29 @@ class CMPCapture:
|
||||
"/api/config", "/beacon", "/track", "/analytics",
|
||||
"/fonts/", "/log/", "/heartbeat", "/.well-known/",
|
||||
"/intake/", "/collect", "/ping", "/metrics",
|
||||
"/login", "/auth", "/user", "/session", "/cart", "/checkout",
|
||||
"/search", "/recommendation", "/flyout", "/menu", "/nav",
|
||||
"/translation", "/i18n", "/locale", "/feature-flag",
|
||||
)):
|
||||
return
|
||||
|
||||
data = await _parse_json_response(response)
|
||||
if data is None:
|
||||
return
|
||||
# Skip tiny payloads — real CMP cookie policies are ≥5KB.
|
||||
# A 4KB JSON of cookie-shaped data is almost never the policy.
|
||||
try:
|
||||
size_kb = len(json.dumps(data)) // 1024
|
||||
except Exception:
|
||||
size_kb = 0
|
||||
if size_kb < 5:
|
||||
return
|
||||
from services.cmp_heuristic import looks_like_cookie_policy
|
||||
if looks_like_cookie_policy(data):
|
||||
self.payloads.append(("_heuristic", data))
|
||||
logger.info(
|
||||
"CMP captured: _heuristic (%s, ~%dKB)",
|
||||
url[:120], len(json.dumps(data)) // 1024,
|
||||
url[:120], size_kb,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.debug("CMP listener error: %s", e)
|
||||
|
||||
@@ -349,7 +349,13 @@ async def discover_dsi_documents(
|
||||
if cmp_capture.payloads:
|
||||
cmp_text = cmp_capture.reconstruct_cookie_policy()
|
||||
cmp_wc = len(cmp_text.split()) if cmp_text else 0
|
||||
if cmp_wc > 0 and (
|
||||
# Replace DOM with CMP only when CMP is *strictly larger*
|
||||
# AND meets at least one of: DOM was very thin, CMP is
|
||||
# substantial, or CMP is significantly longer than DOM.
|
||||
# The strict-larger guard prevents a tiny heuristic match
|
||||
# (e.g. an unrelated /api/data JSON) from clobbering a
|
||||
# bigger DOM extraction.
|
||||
if cmp_wc > self_wc and (
|
||||
self_wc < 300
|
||||
or cmp_wc >= 1000
|
||||
or cmp_wc > self_wc * 1.5
|
||||
|
||||
Reference in New Issue
Block a user