fix(cmp): stricter heuristic + only replace DOM when CMP is strictly larger
Two bugs observed in BMW BMW test run: 1. Generic JSON heuristic captured /de-de/login/bmw/api/flyout/data (4KB, user login fly-out data) and reconstruct_generic produced 56 words of noise. The CMP-prefer logic then 'replaced' the 185-word imprint DOM extraction with those 56 words because self_wc(185) < 300 — even though cmp_wc(56) < self_wc(185). 2. The strict prefilter list was too short. Login/auth/cart endpoints often have category-shaped JSON without being cookie policies. Fixes: - dsi_discovery: replace DOM with CMP only when cmp_wc > self_wc AND meets one of the existing conditions. Tiny captures can no longer silently destroy a bigger DOM extraction. - cmp_extractor: skip non-cookie URLs (/login, /auth, /user, /session, /cart, /checkout, /search, /flyout, /menu, /nav, /translation, /i18n, /locale, /feature-flag). - cmp_extractor: require ≥5KB payload size — real CMP policies are always larger (BMW ePaaS is ~393KB). Tiny matches drop out before reconstruction.
This commit is contained in:
@@ -349,7 +349,13 @@ async def discover_dsi_documents(
|
||||
if cmp_capture.payloads:
|
||||
cmp_text = cmp_capture.reconstruct_cookie_policy()
|
||||
cmp_wc = len(cmp_text.split()) if cmp_text else 0
|
||||
if cmp_wc > 0 and (
|
||||
# Replace DOM with CMP only when CMP is *strictly larger*
|
||||
# AND meets at least one of: DOM was very thin, CMP is
|
||||
# substantial, or CMP is significantly longer than DOM.
|
||||
# The strict-larger guard prevents a tiny heuristic match
|
||||
# (e.g. an unrelated /api/data JSON) from clobbering a
|
||||
# bigger DOM extraction.
|
||||
if cmp_wc > self_wc and (
|
||||
self_wc < 300
|
||||
or cmp_wc >= 1000
|
||||
or cmp_wc > self_wc * 1.5
|
||||
|
||||
Reference in New Issue
Block a user