feat(compliance-check): auto-discover missing doc types from homepage

When the user leaves some doc-type rows empty, the tool now actively
searches the website for them — only marks 'not found' as last resort.

Flow:
1. User submits N URLs (e.g. just DSI)
2. For each canonical doc_type with no submitted URL/text, the route
   identifies the most-common base (scheme://netloc) from submitted URLs
3. Calls consent-tester /dsi-discovery on the homepage with
   max_documents=15 (180s timeout)
4. Classifies every discovered doc into a canonical doc_type via
   title/URL keyword rules (_DISCOVERY_RULES — covers cookie/widerruf/
   social_media/agb/nutzungsbedingungen/dsb/impressum/dse)
5. Fills matching empty entries with the discovered text, marks
   auto_discovered=True and discovery_attempted=True

Padding now differentiates:
- 'Auf der Website nicht gefunden' — discovery was attempted, no doc
  matched. Amber badge, friendly hint to add URL manually.
- 'Nicht eingereicht — Quelle nicht angegeben' — user gave NO URLs at
  all, nothing to crawl from. Grey badge.

Email + frontend:
- Status labels: NICHT GEFUNDEN (amber) vs NICHT EINGEREICHT (grey)
- 'Gepruefte Quellen' table tags auto-discovered URLs with a small blue
  'auto-entdeckt' badge so GF sees what tool found vs user submitted.

Implementation only runs when ≥1 URL was submitted (no base to crawl
from otherwise). Adds 30-90s for unsubmitted types but avoids the
'just say nicht gefunden' anti-pattern.
This commit is contained in:
Benjamin Admin
2026-05-17 01:14:05 +02:00
parent 79efa54898
commit 525038359a
4 changed files with 193 additions and 19 deletions
@@ -167,7 +167,11 @@ export function ChecklistView({ results }: { results: DocResult[] }) {
</div>
</div>
<div className="flex items-center gap-3 shrink-0 ml-3">
{r.error && r.error.startsWith("Nicht eingereicht") ? (
{r.error && r.error.startsWith("Auf der Website nicht gefunden") ? (
<span className="text-xs text-amber-700 font-medium px-2 py-0.5 bg-amber-100 rounded-full whitespace-nowrap">
Nicht gefunden
</span>
) : r.error && r.error.startsWith("Nicht eingereicht") ? (
<span className="text-xs text-gray-500 font-medium px-2 py-0.5 bg-gray-100 rounded-full whitespace-nowrap">
Nicht eingereicht
</span>