feat(vvt): recipient-type classification + 3-section VVT table

Per user request: BMW (and others) put their own services AND external
vendors in the same cookie-policy widget. The VVT-Tabelle now groups
them by Art. 30(1)(d) DSGVO recipient category so the DSB can act on
the right buckets:

  - INTERNAL      — owner processing for itself ('BMW AG — XYZ')
  - GROUP_COMPANY — same brand family, different legal entity ('BMW Bank')
  - PROCESSOR     — Auftragsverarbeiter, AVV-pflichtig (Adobe, Akamai)
  - CONTROLLER    — independent / joint controller (Meta Pixel, Google
                    Ads, LinkedIn — they run their own profiles)
  - AUTHORITY     — government bodies (rare in cookies)
  - OTHER         — fallback

New module vendor_classifier.py:
- owner_from_url(url) — derive site-owner token (bmw.de -> 'BMW',
  mercedes-benz.de -> 'Mercedes-Benz')
- classify(name, category, owner) — strict 5-tier heuristic:
  * INTERNAL: vendor name first-token is '<Owner>' / '<Owner> AG' /
    '<Owner> SE' / '<Owner> GmbH' / '<Owner> AG & Co. KG'
  * GROUP_COMPANY: starts with '<Owner> ' but isn't '<Owner> AG'
  * CONTROLLER: matches a known joint-controller list (Meta, Google
    Ads, YouTube, LinkedIn Insight, TikTok, Pinterest, Taboola,
    Outbrain, Criteo, Twitter, Reddit, ...)
  * PROCESSOR: legal-form suffix in name (GmbH, AG, Inc., A/S,
    B.V., S.A., Ltd., LLC, ...)
  * OTHER: anything else

vendor_extractor.extract_vendors_from_payloads now takes owner_name:
- Passes it through to classify() for every extracted vendor record
- The route derives owner_name via _company_name_from_url(doc_entries)
- LLM-extracted vendors are classified the same way (so V3 fallback
  also produces tagged records)

agent_doc_check_extras.build_vvt_table_html rewritten:
- Buckets vendors by recipient_type
- Renders one section per non-empty bucket, in canonical order
  (RECIPIENT_TYPE_SECTIONS), each with section header + count + bad
  count + nested table
- Within each section: sorted by compliance_score ascending
- Response JSON cmp_vendors includes recipient_type so the frontend
  can later import per-category into the VVT module

Expected BMW result: ~60 INTERNAL rows (BMW AG own services),
~25 PROCESSOR rows (Adobe, Adform, Akamai, AWS, ...), ~5 CONTROLLER
rows (Meta Pixel, Google, LinkedIn, Pinterest, Outbrain, Taboola).
This commit is contained in:
Benjamin Admin
2026-05-17 12:31:49 +02:00
parent 6c7d4c7552
commit fab1e35847
4 changed files with 272 additions and 56 deletions
@@ -42,11 +42,18 @@ def _clean(s: object) -> str:
return _WS_RE.sub(" ", no_tags).strip()
def extract_vendors_from_payloads(payloads: list[dict]) -> list[dict]:
def extract_vendors_from_payloads(
payloads: list[dict],
owner_name: str = "",
) -> list[dict]:
"""Walk every captured CMP payload, dispatch to per-CMP extractor.
Deduplicates vendors across payloads by name (preserves richer record).
Tags each vendor with `recipient_type` (Art. 30(1)(d) DSGVO) using
the owner_name to detect INTERNAL processing.
"""
from compliance.services.vendor_classifier import classify
all_vendors: dict[str, dict] = {}
for payload in payloads or []:
kind = payload.get("kind", "")
@@ -76,9 +83,13 @@ def extract_vendors_from_payloads(payloads: list[dict]) -> list[dict]:
name = (v.get("name") or "").strip()
if not name:
continue
v["recipient_type"] = classify(
vendor_name=name,
category=v.get("category", ""),
owner_name=owner_name,
)
existing = all_vendors.get(name)
if existing:
# Merge cookies + fill empty fields
for k, v_val in v.items():
if not existing.get(k) and v_val:
existing[k] = v_val