fix: derive intake flags from DETECTED SERVICES, not from text content

Fundamental architecture fix: data processing happens through APIs/scripts/ cookies — NOT through visible page text. A news site about healthcare does NOT process health data. Before: Qwen reads website text → guesses "health_data: true" (WRONG) After: Google Analytics detected → tracking: true (CORRECT, deterministic) New flow: detect services from HTML → map service categories to flags → feed flags into UCCA assessment. No LLM needed for flag extraction. SERVICE_TO_FLAGS maps categories: tracking→tracking, marketing→marketing+ third_party_sharing, payment→payment_data, heatmap→profiling, etc. SPECIFIC_SERVICE_FLAGS for Klarna (Art.22), Stripe (US transfer), etc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:37:51 +02:00
parent 0f3ec9061e
commit c5b22e0c99
2 changed files with 141 additions and 86 deletions
@@ -15,7 +15,7 @@ from fastapi import APIRouter
 from pydantic import BaseModel

 from compliance.services.smtp_sender import send_email
-from compliance.services.intake_extractor import extract_intake_flags, flags_to_ucca_intake
+from compliance.services.intake_extractor import extract_intake_flags_from_services, flags_to_ucca_intake
 from compliance.services.relevance_filter import filter_controls
 from compliance.services.website_compliance_checks import (
    check_website_compliance as _check_website_compliance,
@@ -85,10 +85,18 @@ async def analyze_url(req: AnalyzeRequest):
        # Step 2: Classify via SDK LLM
        classification = await _classify(client, text)

-        # Step 3: Extract intake flags via LLM (better than keyword matching)
-        intake_flags = await extract_intake_flags(text)
+        # Step 3: Detect services from HTML (deterministic, no LLM needed)
+        from compliance.services.service_registry import SERVICE_REGISTRY
+        detected_services = []
+        html_lower = raw_html.lower()
+        for pattern, meta in SERVICE_REGISTRY.items():
+            if re.search(pattern, html_lower):
+                detected_services.append(meta)

-        # Step 4: Assess via UCCA with LLM-extracted flags
+        # Step 4: Derive intake flags from DETECTED SERVICES (not from text!)
+        intake_flags = extract_intake_flags_from_services(detected_services)
+
+        # Step 5: Assess via UCCA with service-derived flags
        assessment = await _assess(client, text, classification, intake_flags)

        # Step 5: Determine role