fix(compliance-check): respect auto-discovery 'not found' verdict; DSB not canonical

Two related bugs in the BMW test result:

1. AGB rendered as 'MANGELHAFT 0/13' even though BMW has no public AGB:
   - Auto-discovery correctly returned 'not found' for AGB (no link on
     bmw.de matches AGB keywords).
   - But auto_fill_from_dsi then found the substring 'AGB' in a section
     of the DSI and pseudo-filled the AGB entry with a 264-word DSI
     fragment.
   - cross_search_documents would have done the same.
   - Both now skip entries where discovery_attempted=True AND
     auto_discovered=False — the 'not found' verdict stands.

2. DSB-Kontakt rendered as a separate 100% OK document with 7566 words
   = the entire DSI text:
   - GDPR practice: the DSB is named *inside* the DSI as an email or
     contact block (Art. 13(1)(b)), not as a stand-alone page.
   - cross_search_documents had been assigning the full DSI to the DSB
     row because it matched 'datenschutzbeauftragte' keywords.
   - DSB removed from _ALL_DOC_TYPES — no longer canonical, no longer
     padded as missing, no longer auto-discovered. The frontend row
     remains so a tenant with a separate DSB page can still submit one.

After this fix BMW should render:
- DSE: OK
- Impressum: LUECKENHAFT (unchanged — regex gaps to fix separately)
- Cookie-Richtlinie: OK
- Social Media: NICHT GEFUNDEN (bmw.de does not link to it)
- AGB: NICHT GEFUNDEN (correct — BMW has no public AGB)
- Nutzungsbedingungen: NICHT GEFUNDEN
- Widerruf: NICHT GEFUNDEN
This commit is contained in:
Benjamin Admin
2026-05-17 01:53:09 +02:00
parent c4be077c5d
commit b090662524
2 changed files with 21 additions and 5 deletions
@@ -890,13 +890,19 @@ _DOC_TYPE_LABELS = {
"dsb": "DSB-Kontakt",
}
# Canonical 8 doc types in the same order as the frontend ComplianceCheckTab.
# Canonical doc types in the same order as the frontend ComplianceCheckTab.
# The route pads `results` to always contain an entry for each — even if
# the user did not submit a URL — so the email + frontend always show
# the complete checklist (missing rows marked as 'Nicht eingereicht').
#
# DSB-Kontakt is intentionally NOT canonical: per GDPR practice the DSB is
# named *inside* the DSI/datenschutz document (email or contact block), not
# as a separate page. We check 'DSB benannt' as a sub-check of the DSE
# instead. If a tenant insists on a separate DSB document, they can still
# submit one — it just won't appear as a missing checklist row.
_ALL_DOC_TYPES = [
"dse", "impressum", "social_media", "cookie",
"agb", "nutzungsbedingungen", "widerruf", "dsb",
"agb", "nutzungsbedingungen", "widerruf",
]