fix(cookie-inventory): fuzzy prefix-match + BMW-GT-File

BMW-Mail zeigte 738 deklariert / 31 Browser / **0 OK** — alle
Browser-Cookies landeten als UNDOC, alle deklarierten als ORPH.
Ursache: exact-string-match scheitert bei Suffix-Cookies.

_norm_for_match() + _matches() Helper:
  - Strippt Wildcards (`*`, `.*`, `<id>`, `{var}`) + Lower-Case
  - Erhält führende Underscores (`__cf_bm`, `_ga` sind meaningful)
  - Prefix-Match in BEIDE Richtungen, min 3 Chars (kein "_"-Garbage)

build_cookie_inventory():
  - Für jeden Browser-Cookie: längster Prefix-Match in declared wählen
  - browser-to-decl Index + decl-match-Index für O(N×M) → O(N+M)
  - matched browser-keys werden aus all_keys entfernt → kein
    Double-Count (vorher: ORPH + UNDOC parallel)

Realistischer BMW-Match-Test:
  declared=[_ga, _gid, __cf_bm, AMP_TOKEN, _fbp, intercom-session,
            _pk_id.*, OptanonConsent]
  browser= [_ga_K8YL3M9T, _gid_xyz, __cf_bm_actual_hash,
            AMP_TOKEN_runtime, _fbp_123, intercom-session-2026,
            _pk_id.5.7d8, OptanonConsent]
  → 8 OK (vorher 0)

BMW-GT-File (zeroclaw/docs/ground-truth/bmw_de_2026-06-07.json):
  - OneTrust CMP + 14 erwartete Vendoren
  - Cookie-Count-Ranges (browser 80-250, deklariert 300-800)
  - 7 expected findings inkl. neuem COOKIE-INVENTORY-MATCH-001 als
    Benchmark gegen den Fuzzy-Match-Bug

Tests: 14/14 grün (4 _norm_for_match + 5 _matches + 5
build_cookie_inventory inkl. realistic_bmw_pattern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-06-07 21:29:21 +02:00
parent b16130369a
commit 0b29d1fada
3 changed files with 289 additions and 2 deletions
@@ -40,6 +40,54 @@ def _norm(s: str | None) -> str:
return (s or "").strip().lower()
def _norm_for_match(s: str) -> str:
"""Normalised name for fuzzy matching.
Common patterns in DSE-tables: wildcards (`_ga*`, `_ga.*`, `_pk_id.*`,
`<random>`), trailing dots, brackets. Browser cookies often have
a runtime suffix (`_ga_K8YL3M9T`, `__cf_bm_session_hash`). We strip
trailing wildcards / suffix-noise so the prefix-match below works.
IMPORTANT: leading `_`/`__` are MEANINGFUL (`__cf_bm`, `_ga`) and
must NOT be stripped.
"""
out = _norm(s)
out = out.replace("*", "").replace("", "")
out = re.sub(r"\.\*$", "", out)
out = re.sub(r"\.\$?$", "", out)
out = re.sub(r"<[^>]+>", "", out)
out = re.sub(r"\{[^}]+\}", "", out)
return out.strip()
def _matches(decl_key: str, browser_key: str) -> bool:
"""Fuzzy match between a declared cookie name and a browser cookie.
Rules (in priority order):
1. exact match after normalisation
2. declared is a PREFIX of browser (declared "_ga" matches
browser "_ga_k8yl3m9t")
3. browser is a PREFIX of declared (rare: declared has a
specific variant, browser only generic — e.g. declared
"__cf_bm_session" with browser "__cf_bm")
"""
if not decl_key or not browser_key:
return False
if decl_key == browser_key:
return True
# Only allow prefix-match for prefixes ≥ 3 chars to avoid garbage
# (e.g. declared "_" matching everything).
if len(decl_key) >= 3 and browser_key.startswith(decl_key):
return True
if len(browser_key) >= 3 and decl_key.startswith(browser_key):
return True
return False
# Need re-import for the regex use above
import re # noqa: E402
def _missing(value: str | None) -> bool:
if value is None:
return True
@@ -190,7 +238,30 @@ def build_cookie_inventory(state: dict) -> tuple[list[dict], dict]:
for c in (cookie_audit.get("compliant") or [])
}
all_keys = set(declared.keys()) | set(browser.keys())
# Build fuzzy-match-Index: declared-key (normalised) → list of
# browser-keys that match. Browser-key only matches ONE declared
# entry (the longest prefix match wins) so we don't double-count.
decl_match_index: dict[str, list[str]] = {k: [] for k in declared}
browser_to_decl: dict[str, str] = {}
for bkey in browser:
bnorm = _norm_for_match(bkey)
best = ""
best_len = -1
for dkey in declared:
dnorm = _norm_for_match(dkey)
if _matches(dnorm, bnorm) and len(dnorm) > best_len:
best = dkey
best_len = len(dnorm)
if best:
decl_match_index[best].append(bkey)
browser_to_decl[bkey] = best
# all_keys = declared + browser, but browser-keys that fuzzy-match
# an existing declared entry are FOLDED into the declared row
# (avoid double-counting them as both ORPH and UNDOC).
matched_browser_keys = set(browser_to_decl.keys())
all_keys = (set(declared.keys())
| (set(browser.keys()) - matched_browser_keys))
rows: list[dict] = []
for key in sorted(all_keys):
d = declared.get(key) or {}
@@ -200,7 +271,7 @@ def build_cookie_inventory(state: dict) -> tuple[list[dict], dict]:
or b.get("domain") or "").strip() or ""
country = d.get("country", "")
country_display, is_third, adq = _country_third(country)
in_browser = key in browser
in_browser = (key in browser) or bool(decl_match_index.get(key))
is_declared = key in declared
status, sev = _build_status(
is_declared, in_browser, undeclared_set, compliant_set, key,