When the user leaves some doc-type rows empty, the tool now actively
searches the website for them — only marks 'not found' as last resort.
Flow:
1. User submits N URLs (e.g. just DSI)
2. For each canonical doc_type with no submitted URL/text, the route
identifies the most-common base (scheme://netloc) from submitted URLs
3. Calls consent-tester /dsi-discovery on the homepage with
max_documents=15 (180s timeout)
4. Classifies every discovered doc into a canonical doc_type via
title/URL keyword rules (_DISCOVERY_RULES — covers cookie/widerruf/
social_media/agb/nutzungsbedingungen/dsb/impressum/dse)
5. Fills matching empty entries with the discovered text, marks
auto_discovered=True and discovery_attempted=True
Padding now differentiates:
- 'Auf der Website nicht gefunden' — discovery was attempted, no doc
matched. Amber badge, friendly hint to add URL manually.
- 'Nicht eingereicht — Quelle nicht angegeben' — user gave NO URLs at
all, nothing to crawl from. Grey badge.
Email + frontend:
- Status labels: NICHT GEFUNDEN (amber) vs NICHT EINGEREICHT (grey)
- 'Gepruefte Quellen' table tags auto-discovered URLs with a small blue
'auto-entdeckt' badge so GF sees what tool found vs user submitted.
Implementation only runs when ≥1 URL was submitted (no base to crawl
from otherwise). Adds 30-90s for unsubmitted types but avoids the
'just say nicht gefunden' anti-pattern.