feat(vvt): recipient-type classification + 3-section VVT table
Per user request: BMW (and others) put their own services AND external
vendors in the same cookie-policy widget. The VVT-Tabelle now groups
them by Art. 30(1)(d) DSGVO recipient category so the DSB can act on
the right buckets:
- INTERNAL — owner processing for itself ('BMW AG — XYZ')
- GROUP_COMPANY — same brand family, different legal entity ('BMW Bank')
- PROCESSOR — Auftragsverarbeiter, AVV-pflichtig (Adobe, Akamai)
- CONTROLLER — independent / joint controller (Meta Pixel, Google
Ads, LinkedIn — they run their own profiles)
- AUTHORITY — government bodies (rare in cookies)
- OTHER — fallback
New module vendor_classifier.py:
- owner_from_url(url) — derive site-owner token (bmw.de -> 'BMW',
mercedes-benz.de -> 'Mercedes-Benz')
- classify(name, category, owner) — strict 5-tier heuristic:
* INTERNAL: vendor name first-token is '<Owner>' / '<Owner> AG' /
'<Owner> SE' / '<Owner> GmbH' / '<Owner> AG & Co. KG'
* GROUP_COMPANY: starts with '<Owner> ' but isn't '<Owner> AG'
* CONTROLLER: matches a known joint-controller list (Meta, Google
Ads, YouTube, LinkedIn Insight, TikTok, Pinterest, Taboola,
Outbrain, Criteo, Twitter, Reddit, ...)
* PROCESSOR: legal-form suffix in name (GmbH, AG, Inc., A/S,
B.V., S.A., Ltd., LLC, ...)
* OTHER: anything else
vendor_extractor.extract_vendors_from_payloads now takes owner_name:
- Passes it through to classify() for every extracted vendor record
- The route derives owner_name via _company_name_from_url(doc_entries)
- LLM-extracted vendors are classified the same way (so V3 fallback
also produces tagged records)
agent_doc_check_extras.build_vvt_table_html rewritten:
- Buckets vendors by recipient_type
- Renders one section per non-empty bucket, in canonical order
(RECIPIENT_TYPE_SECTIONS), each with section header + count + bad
count + nested table
- Within each section: sorted by compliance_score ascending
- Response JSON cmp_vendors includes recipient_type so the frontend
can later import per-category into the VVT module
Expected BMW result: ~60 INTERNAL rows (BMW AG own services),
~25 PROCESSOR rows (Adobe, Adform, Akamai, AWS, ...), ~5 CONTROLLER
rows (Meta Pixel, Google, LinkedIn, Pinterest, Outbrain, Taboola).
This commit is contained in:
@@ -390,8 +390,13 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
cookie_payloads.extend(e["cmp_payloads"])
|
||||
if e.get("text"):
|
||||
cookie_text = e["text"]
|
||||
# Site-owner derived from the submitted URLs — drives the
|
||||
# INTERNAL/GROUP_COMPANY classification of vendor records.
|
||||
owner_name = _company_name_from_url(doc_entries) or ""
|
||||
if cookie_payloads:
|
||||
cmp_vendors = extract_vendors_from_payloads(cookie_payloads)
|
||||
cmp_vendors = extract_vendors_from_payloads(
|
||||
cookie_payloads, owner_name=owner_name,
|
||||
)
|
||||
# V3 fallback: no named CMP captured but we have substantive
|
||||
# cookie text → ask Qwen/OVH to extract vendor list from the text.
|
||||
# Skip on very short text (likely navigation) to save LLM cost.
|
||||
@@ -399,8 +404,17 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
from compliance.services.vendor_llm_extractor import (
|
||||
extract_vendors_via_llm,
|
||||
)
|
||||
from compliance.services.vendor_classifier import classify
|
||||
_update(check_id, "Vendor-Liste per LLM extrahieren...", 94)
|
||||
cmp_vendors = await extract_vendors_via_llm(cookie_text)
|
||||
# LLM path doesn't run through extract_vendors_from_payloads,
|
||||
# so classify here.
|
||||
for v in cmp_vendors:
|
||||
v["recipient_type"] = classify(
|
||||
vendor_name=v.get("name", ""),
|
||||
category=v.get("category", ""),
|
||||
owner_name=owner_name,
|
||||
)
|
||||
if cmp_vendors:
|
||||
logger.info("VVT: %d vendors extracted, validating links",
|
||||
len(cmp_vendors))
|
||||
|
||||
Reference in New Issue
Block a user