feat(compliance-check): auto-discover missing doc types from homepage
When the user leaves some doc-type rows empty, the tool now actively searches the website for them — only marks 'not found' as last resort. Flow: 1. User submits N URLs (e.g. just DSI) 2. For each canonical doc_type with no submitted URL/text, the route identifies the most-common base (scheme://netloc) from submitted URLs 3. Calls consent-tester /dsi-discovery on the homepage with max_documents=15 (180s timeout) 4. Classifies every discovered doc into a canonical doc_type via title/URL keyword rules (_DISCOVERY_RULES — covers cookie/widerruf/ social_media/agb/nutzungsbedingungen/dsb/impressum/dse) 5. Fills matching empty entries with the discovered text, marks auto_discovered=True and discovery_attempted=True Padding now differentiates: - 'Auf der Website nicht gefunden' — discovery was attempted, no doc matched. Amber badge, friendly hint to add URL manually. - 'Nicht eingereicht — Quelle nicht angegeben' — user gave NO URLs at all, nothing to crawl from. Grey badge. Email + frontend: - Status labels: NICHT GEFUNDEN (amber) vs NICHT EINGEREICHT (grey) - 'Gepruefte Quellen' table tags auto-discovered URLs with a small blue 'auto-entdeckt' badge so GF sees what tool found vs user submitted. Implementation only runs when ≥1 URL was submitted (no base to crawl from otherwise). Adds 30-90s for unsubmitted types but avoids the 'just say nicht gefunden' anti-pattern.
This commit is contained in:
@@ -30,15 +30,21 @@ def build_scanned_urls_html(doc_entries: list[dict]) -> str:
|
||||
seen.add(url)
|
||||
label = _doc_type_label(entry.get("doc_type", ""))
|
||||
words = entry.get("word_count") or 0
|
||||
auto = entry.get("auto_discovered")
|
||||
try:
|
||||
netloc = urlparse(url).netloc.lower().lstrip("www.")
|
||||
if netloc:
|
||||
domains.setdefault(netloc, []).append(label)
|
||||
except Exception:
|
||||
pass
|
||||
badge = ('<span style="display:inline-block;margin-left:6px;'
|
||||
'background:#dbeafe;color:#1e40af;font-size:10px;'
|
||||
'padding:1px 6px;border-radius:8px;font-family:sans-serif">'
|
||||
'auto-entdeckt</span>') if auto else ""
|
||||
rows.append(
|
||||
f'<tr>'
|
||||
f'<td style="padding:3px 12px 3px 0;color:#475569;font-size:12px">{label}</td>'
|
||||
f'<td style="padding:3px 12px 3px 0;color:#475569;font-size:12px">'
|
||||
f'{label}{badge}</td>'
|
||||
f'<td style="padding:3px 12px 3px 0;font-size:12px;'
|
||||
f'font-family:ui-monospace,monospace;color:#1e293b;word-break:break-all">'
|
||||
f'<a href="{url}" style="color:#2563eb;text-decoration:none">{url}</a></td>'
|
||||
|
||||
Reference in New Issue
Block a user