fix(compliance-check): respect auto-discovery 'not found' verdict; DSB not canonical
Two related bugs in the BMW test result:
1. AGB rendered as 'MANGELHAFT 0/13' even though BMW has no public AGB:
- Auto-discovery correctly returned 'not found' for AGB (no link on
bmw.de matches AGB keywords).
- But auto_fill_from_dsi then found the substring 'AGB' in a section
of the DSI and pseudo-filled the AGB entry with a 264-word DSI
fragment.
- cross_search_documents would have done the same.
- Both now skip entries where discovery_attempted=True AND
auto_discovered=False — the 'not found' verdict stands.
2. DSB-Kontakt rendered as a separate 100% OK document with 7566 words
= the entire DSI text:
- GDPR practice: the DSB is named *inside* the DSI as an email or
contact block (Art. 13(1)(b)), not as a stand-alone page.
- cross_search_documents had been assigning the full DSI to the DSB
row because it matched 'datenschutzbeauftragte' keywords.
- DSB removed from _ALL_DOC_TYPES — no longer canonical, no longer
padded as missing, no longer auto-discovered. The frontend row
remains so a tenant with a separate DSB page can still submit one.
After this fix BMW should render:
- DSE: OK
- Impressum: LUECKENHAFT (unchanged — regex gaps to fix separately)
- Cookie-Richtlinie: OK
- Social Media: NICHT GEFUNDEN (bmw.de does not link to it)
- AGB: NICHT GEFUNDEN (correct — BMW has no public AGB)
- Nutzungsbedingungen: NICHT GEFUNDEN
- Widerruf: NICHT GEFUNDEN
This commit is contained in:
@@ -890,13 +890,19 @@ _DOC_TYPE_LABELS = {
|
||||
"dsb": "DSB-Kontakt",
|
||||
}
|
||||
|
||||
# Canonical 8 doc types in the same order as the frontend ComplianceCheckTab.
|
||||
# Canonical doc types in the same order as the frontend ComplianceCheckTab.
|
||||
# The route pads `results` to always contain an entry for each — even if
|
||||
# the user did not submit a URL — so the email + frontend always show
|
||||
# the complete checklist (missing rows marked as 'Nicht eingereicht').
|
||||
#
|
||||
# DSB-Kontakt is intentionally NOT canonical: per GDPR practice the DSB is
|
||||
# named *inside* the DSI/datenschutz document (email or contact block), not
|
||||
# as a separate page. We check 'DSB benannt' as a sub-check of the DSE
|
||||
# instead. If a tenant insists on a separate DSB document, they can still
|
||||
# submit one — it just won't appear as a missing checklist row.
|
||||
_ALL_DOC_TYPES = [
|
||||
"dse", "impressum", "social_media", "cookie",
|
||||
"agb", "nutzungsbedingungen", "widerruf", "dsb",
|
||||
"agb", "nutzungsbedingungen", "widerruf",
|
||||
]
|
||||
|
||||
|
||||
|
||||
@@ -199,6 +199,12 @@ def auto_fill_from_dsi(doc_entries: list[dict]) -> None:
|
||||
for entry in doc_entries:
|
||||
if entry.get("text") or entry.get("url"):
|
||||
continue # Already has content
|
||||
# Auto-discovery already tried + decided: skip. Don't override its
|
||||
# 'NICHT GEFUNDEN' verdict with a pseudo-match from DSI sections
|
||||
# (which produces false MANGELHAFT findings for genuinely missing
|
||||
# docs like BMW's AGB).
|
||||
if entry.get("discovery_attempted") and not entry.get("auto_discovered"):
|
||||
continue
|
||||
|
||||
doc_type = entry["doc_type"]
|
||||
section_text = _find_section_for_type(sections, doc_type)
|
||||
@@ -267,8 +273,10 @@ def cross_search_documents(doc_entries: list[dict]) -> list[dict]:
|
||||
return findings
|
||||
|
||||
# For each entry, check if:
|
||||
# a) It's empty → search other texts
|
||||
# b) It has text but the text doesn't match the doc_type → search other texts
|
||||
# a) It has text but the text doesn't match the doc_type → search other texts
|
||||
# (Empty entries from auto-discovery 'not found' are NOT pseudo-filled
|
||||
# from other docs — that would silently revive a 'NICHT GEFUNDEN' verdict
|
||||
# as a misleading MANGELHAFT row.)
|
||||
for entry in doc_entries:
|
||||
target_type = entry["doc_type"]
|
||||
keywords = _DOC_TYPE_KEYWORDS.get(target_type, [])
|
||||
@@ -278,13 +286,15 @@ def cross_search_documents(doc_entries: list[dict]) -> list[dict]:
|
||||
has_text = entry.get("text") and len(entry["text"].split()) > 50
|
||||
text_matches = False
|
||||
if has_text:
|
||||
# Check if the current text actually contains this doc_type's content
|
||||
entry_lower = entry["text"].lower()
|
||||
match_score = sum(1 for kw in keywords if kw in entry_lower)
|
||||
text_matches = match_score >= 2
|
||||
|
||||
if has_text and text_matches:
|
||||
continue # Text present AND matches doc_type → skip
|
||||
# Skip empty entries the auto-discovery has already ruled on.
|
||||
if not has_text and entry.get("discovery_attempted") and not entry.get("auto_discovered"):
|
||||
continue
|
||||
|
||||
# Search all other texts for this doc_type's keywords
|
||||
best_match: dict | None = None
|
||||
|
||||
Reference in New Issue
Block a user