fix(audit): parse_flat_cookie_text fuer VW-Style Flat-Tabellen
CI / loc-budget (push) Failing after 19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m4s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 19s

VW Cookie-Doc liefert die Tabelle als FLACHEN Text ohne Spalten-Trenner:
'IDE Tracking Cookies (Marketing) Beschreibung 13 Monate Permanent
TAID Tracking Cookies (Marketing) ...'

parse_flat_cookie_text matched mit Regex:
  NAME [Tracking|Session|Funktional|...] Cookies ... [13 Monate|Session|Permanent]

Backend faellt bei parse_cookie_table=[] auf parse_flat zurueck. Damit
holen wir aus dem 65k VW Cookie-Doc ~30-50 Cookies + Vendors deterministisch,
auch wenn der HTML-Table-DOM-Extract leer ist (was passiert wenn die
Tabelle aus mehreren append-Code-Pfaden geladen wird).

Bonus: _extract_dom_tables Helper in dsi_discovery.py vorbereitet fuer
spaeteres Einhaengen an allen 7 DiscoveredDSI.append-Stellen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 21:24:14 +02:00
parent dfac940272
commit 1451873194
3 changed files with 104 additions and 4 deletions
@@ -862,16 +862,19 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
except Exception as e:
logger.warning("html_table parse failed: %s", e)
# B — cookies_table_parser auch auf gecrawltem Cookie-Text
# (nicht nur bei User-Paste). Wenn der Crawler Tab/Pipe-
# getrennte Tabellen-Reihen erhalten hat, parsen wir sie
# deterministisch und mergen die Vendor-Records.
# B — cookies_table_parser auch auf gecrawltem Cookie-Text.
# Erst Standard-Parse (Tab/Pipe-getrennt). Wenn der nichts
# findet (kein Separator), Flat-Pattern-Parse fuer Sites wie
# VW die ihre Tabelle als flachen Text liefern.
if cookie_text and len(cookie_text) >= 500:
try:
from compliance.services.cookies_table_parser import (
parse_cookie_table as _parse_ct,
parse_flat_cookie_text as _parse_flat,
)
crawled_table_vendors = _parse_ct(cookie_text)
if not crawled_table_vendors:
crawled_table_vendors = _parse_flat(cookie_text)
if crawled_table_vendors:
existing = {(v.get("name") or "").strip().lower()
for v in cmp_vendors}