docs: add Phase 4 (Website-Scan) to Control Relevance Filter plan
Multi-page crawl: scan 5-10 strategic pages (start, footer links) for chatbot widgets, AI text mentions, and tracking services. Feed results into relevance filter to reduce false positives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -154,6 +154,168 @@ UCCA Assessment
|
|||||||
| Migration | `relevance_conditions` Spalte |
|
| Migration | `relevance_conditions` Spalte |
|
||||||
| `control-pipeline/` | Batch-Seeding Job (Phase 3) |
|
| `control-pipeline/` | Batch-Seeding Job (Phase 3) |
|
||||||
|
|
||||||
|
## Phase 4: Website-Scan (Multi-Page Crawl)
|
||||||
|
|
||||||
|
### Problem
|
||||||
|
|
||||||
|
Aktuell analysieren wir nur EINE URL (z.B. `/datenschutz/`). Aber relevante Hinweise
|
||||||
|
auf KI, Chatbots, automatisierte Entscheidungen oder Tracking koennen auf ANDEREN
|
||||||
|
Seiten der Website stehen:
|
||||||
|
|
||||||
|
- Chatbot-Widget auf der Startseite (nicht auf der Datenschutzseite)
|
||||||
|
- "Powered by ChatGPT" im Footer
|
||||||
|
- KI-gestuetzte Produktempfehlungen auf der Shopseite
|
||||||
|
- Cookie-Scripts die Tracking-Dienste laden (Google Analytics, Meta Pixel, etc.)
|
||||||
|
- Chatbot-Anbieter wie Intercom, Drift, Zendesk, Tidio im HTML
|
||||||
|
|
||||||
|
### Loesung: Lightweight Website-Scan
|
||||||
|
|
||||||
|
Kein vollstaendiger Crawl (zu langsam, zu invasiv), sondern ein gezielter Scan
|
||||||
|
von 5-10 strategischen Seiten:
|
||||||
|
|
||||||
|
```
|
||||||
|
Eingabe: https://www.opodo.de/datenschutz/
|
||||||
|
|
||||||
|
Automatisch gescannte Seiten:
|
||||||
|
1. Startseite: https://www.opodo.de/
|
||||||
|
2. Datenschutz (bereits): https://www.opodo.de/datenschutz/
|
||||||
|
3. Impressum: https://www.opodo.de/impressum/ (aus Footer-Links)
|
||||||
|
4. AGB: https://www.opodo.de/agb/ (aus Footer-Links)
|
||||||
|
5. Cookie-Policy: https://www.opodo.de/cookies/ (falls vorhanden)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scan-Logik
|
||||||
|
|
||||||
|
**Schritt 1: Startseite holen + Footer-Links extrahieren**
|
||||||
|
```python
|
||||||
|
# Aus der Startseite die typischen Footer-Links extrahieren:
|
||||||
|
footer_patterns = [
|
||||||
|
r'href="([^"]*(?:impressum|imprint|legal)[^"]*)"',
|
||||||
|
r'href="([^"]*(?:datenschutz|privacy|dsgvo)[^"]*)"',
|
||||||
|
r'href="([^"]*(?:agb|terms|nutzungsbedingungen)[^"]*)"',
|
||||||
|
r'href="([^"]*(?:cookie|cookies)[^"]*)"',
|
||||||
|
r'href="([^"]*(?:kontakt|contact)[^"]*)"',
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Schritt 2: Jede Seite auf KI/Chatbot/Tracking-Indikatoren scannen**
|
||||||
|
```python
|
||||||
|
AI_INDICATORS = {
|
||||||
|
# Chatbot-Widgets (JavaScript-Einbindungen)
|
||||||
|
"chatbot_widgets": [
|
||||||
|
r"intercom", # Intercom (KI-gestuetzt)
|
||||||
|
r"drift\.com", # Drift Chatbot
|
||||||
|
r"tidio", # Tidio Chat
|
||||||
|
r"zendesk", # Zendesk Chat
|
||||||
|
r"crisp\.chat", # Crisp Chat
|
||||||
|
r"livechat", # LiveChat
|
||||||
|
r"hubspot.*chat", # HubSpot Chat
|
||||||
|
r"tawk\.to", # Tawk.to
|
||||||
|
r"freshchat", # Freshworks
|
||||||
|
r"dialogflow", # Google Dialogflow
|
||||||
|
r"watson.*assistant", # IBM Watson
|
||||||
|
r"chatgpt|openai", # OpenAI/ChatGPT
|
||||||
|
r"anthropic|claude", # Anthropic/Claude
|
||||||
|
],
|
||||||
|
# KI-Hinweise im Text
|
||||||
|
"ai_text_mentions": [
|
||||||
|
r"k(?:ue|ü)nstliche.?intelligenz",
|
||||||
|
r"artificial.?intelligence",
|
||||||
|
r"machine.?learning",
|
||||||
|
r"maschinelles.?lernen",
|
||||||
|
r"KI.?gest(?:ue|ü)tzt",
|
||||||
|
r"AI.?powered",
|
||||||
|
r"algorithm",
|
||||||
|
r"automatisierte.?entscheidung",
|
||||||
|
r"automated.?decision",
|
||||||
|
r"profiling",
|
||||||
|
r"personalisier", # Personalisierung
|
||||||
|
],
|
||||||
|
# Tracking-Dienste
|
||||||
|
"tracking_services": [
|
||||||
|
r"google.?analytics|gtag|UA-\d+|G-\w+",
|
||||||
|
r"facebook.?pixel|fbq\(",
|
||||||
|
r"meta.?pixel",
|
||||||
|
r"hotjar",
|
||||||
|
r"segment\.com",
|
||||||
|
r"mixpanel",
|
||||||
|
r"amplitude",
|
||||||
|
r"matomo|piwik",
|
||||||
|
r"plausible",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Schritt 3: Ergebnis aggregieren**
|
||||||
|
```python
|
||||||
|
scan_result = {
|
||||||
|
"pages_scanned": 5,
|
||||||
|
"chatbot_detected": True, # z.B. Intercom auf Startseite
|
||||||
|
"chatbot_provider": "intercom", # Identifizierter Anbieter
|
||||||
|
"ai_mentions_found": False, # Kein expliziter KI-Text
|
||||||
|
"tracking_services": ["google_analytics", "facebook_pixel"],
|
||||||
|
"tracking_count": 2,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Schritt 4: Scan-Ergebnis in Relevanzpruefung einbeziehen**
|
||||||
|
- Chatbot erkannt → C_TRANSPARENCY wird relevant (auch ohne KI-Text)
|
||||||
|
- Tracking erkannt → C_EXPLICIT_CONSENT wird relevant
|
||||||
|
- Kein KI-Nachweis auf gesamter Website → C_TRANSPARENCY faellt weg
|
||||||
|
|
||||||
|
### Implementierung
|
||||||
|
|
||||||
|
**Neue Datei:** `backend-compliance/compliance/services/website_scanner.py` (~200 LOC)
|
||||||
|
|
||||||
|
```python
|
||||||
|
class WebsiteScanner:
|
||||||
|
async def scan(self, base_url: str) -> ScanResult:
|
||||||
|
"""Scan 5-10 pages for AI, chatbot, and tracking indicators."""
|
||||||
|
pages = await self._discover_pages(base_url)
|
||||||
|
indicators = {}
|
||||||
|
for page_url in pages[:10]:
|
||||||
|
html = await self._fetch(page_url)
|
||||||
|
indicators[page_url] = self._detect_indicators(html)
|
||||||
|
return self._aggregate(indicators)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Integration in Agent-Workflow:**
|
||||||
|
- Zwischen Schritt 1 (Fetch) und Schritt 3 (UCCA Assess)
|
||||||
|
- Scan-Ergebnis fliesst in die Intake-Flags UND in den Relevanzfilter
|
||||||
|
- Scan-Ergebnis wird im Response zurueckgegeben (Transparenz)
|
||||||
|
|
||||||
|
**Frontend-Erweiterung:**
|
||||||
|
- "Erweiterte Analyse" Toggle: Nur Einzelseite vs. Website-Scan
|
||||||
|
- Scan-Ergebnis als aufklappbare Sektion: "5 Seiten gescannt, Chatbot auf Startseite erkannt"
|
||||||
|
|
||||||
|
### Aufwand
|
||||||
|
|
||||||
|
| Komponente | LOC | Zeit |
|
||||||
|
|-----------|-----|------|
|
||||||
|
| `website_scanner.py` | ~200 | 0.5 Tage |
|
||||||
|
| Integration in `agent_analyze_routes.py` | ~50 | 2h |
|
||||||
|
| Frontend: Scan-Ergebnis anzeigen | ~80 | 2h |
|
||||||
|
| Tests | ~100 | 2h |
|
||||||
|
|
||||||
|
### Beispiel: Opodo mit Website-Scan
|
||||||
|
|
||||||
|
```
|
||||||
|
Seiten gescannt: 5
|
||||||
|
- https://www.opodo.de/ → Didomi Cookie-Consent, Google Analytics
|
||||||
|
- https://www.opodo.de/datenschutz/ → Datenschutzerklaerung
|
||||||
|
- https://www.opodo.de/impressum/ → 404 (FINDING!)
|
||||||
|
- https://www.opodo.de/agb/ → AGB vorhanden
|
||||||
|
- https://www.opodo.de/cookies/ → Cookie-Policy
|
||||||
|
|
||||||
|
Chatbot erkannt: Nein
|
||||||
|
KI-Hinweise: Nein
|
||||||
|
Tracking: Google Analytics (G-03F834EHLM), Didomi CMP
|
||||||
|
|
||||||
|
→ C_TRANSPARENCY: NICHT relevant (kein KI-Nachweis auf gesamter Website)
|
||||||
|
→ C_EXPLICIT_CONSENT: Relevant (Google Analytics + Didomi = Tracking aktiv)
|
||||||
|
→ Impressum-Finding: 404 auf /impressum/ (§5 TMG Verstoss)
|
||||||
|
```
|
||||||
|
|
||||||
## Risiken
|
## Risiken
|
||||||
|
|
||||||
| Risiko | Mitigation |
|
| Risiko | Mitigation |
|
||||||
|
|||||||
Reference in New Issue
Block a user