fix: derive intake flags from DETECTED SERVICES, not from text content
Fundamental architecture fix: data processing happens through APIs/scripts/ cookies — NOT through visible page text. A news site about healthcare does NOT process health data. Before: Qwen reads website text → guesses "health_data: true" (WRONG) After: Google Analytics detected → tracking: true (CORRECT, deterministic) New flow: detect services from HTML → map service categories to flags → feed flags into UCCA assessment. No LLM needed for flag extraction. SERVICE_TO_FLAGS maps categories: tracking→tracking, marketing→marketing+ third_party_sharing, payment→payment_data, heatmap→profiling, etc. SPECIFIC_SERVICE_FLAGS for Klarna (Art.22), Stripe (US transfer), etc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -15,7 +15,7 @@ from fastapi import APIRouter
|
||||
from pydantic import BaseModel
|
||||
|
||||
from compliance.services.smtp_sender import send_email
|
||||
from compliance.services.intake_extractor import extract_intake_flags, flags_to_ucca_intake
|
||||
from compliance.services.intake_extractor import extract_intake_flags_from_services, flags_to_ucca_intake
|
||||
from compliance.services.relevance_filter import filter_controls
|
||||
from compliance.services.website_compliance_checks import (
|
||||
check_website_compliance as _check_website_compliance,
|
||||
@@ -85,10 +85,18 @@ async def analyze_url(req: AnalyzeRequest):
|
||||
# Step 2: Classify via SDK LLM
|
||||
classification = await _classify(client, text)
|
||||
|
||||
# Step 3: Extract intake flags via LLM (better than keyword matching)
|
||||
intake_flags = await extract_intake_flags(text)
|
||||
# Step 3: Detect services from HTML (deterministic, no LLM needed)
|
||||
from compliance.services.service_registry import SERVICE_REGISTRY
|
||||
detected_services = []
|
||||
html_lower = raw_html.lower()
|
||||
for pattern, meta in SERVICE_REGISTRY.items():
|
||||
if re.search(pattern, html_lower):
|
||||
detected_services.append(meta)
|
||||
|
||||
# Step 4: Assess via UCCA with LLM-extracted flags
|
||||
# Step 4: Derive intake flags from DETECTED SERVICES (not from text!)
|
||||
intake_flags = extract_intake_flags_from_services(detected_services)
|
||||
|
||||
# Step 5: Assess via UCCA with service-derived flags
|
||||
assessment = await _assess(client, text, classification, intake_flags)
|
||||
|
||||
# Step 5: Determine role
|
||||
|
||||
Reference in New Issue
Block a user