fix: 4 bugs from IHK Konstanz scan validation

1. DSE-Matcher: Google/YouTube false match — now requires 2+ word match
   for provider-name fallback, not just "Google" matching YouTube section
2. AGB/Widerrufsbelehrung: only_ecommerce flag — skips for non-shop
   websites (detected via payment providers, cart keywords)
3. DSE-internal link following — scanner now discovers links WITHIN the
   privacy policy and scans those too (finds regional DSE sub-pages)
4. Expanded keyword synonyms for DSE mandatory checks:
   - "Zweck und Rechtsgrundlage" now matches "zwecke"
   - "behoerdlichen datenschutzbeauftragt" matches DSB
   - "aufsichtsbehörde" with umlaut matches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-04-29 17:57:19 +02:00
parent 0f3ba9c207
commit fff47cc52e
3 changed files with 70 additions and 7 deletions
@@ -64,8 +64,21 @@ def match_service_to_dse(
)
# Step 2: Search for provider name (e.g., "Google" for "Google Analytics")
# But only if the provider name is specific enough — avoid "Google" matching YouTube
provider = service_name.split()[0] if " " in service_name else service_name
if len(provider) < 4 or provider.lower() in ("the", "a", "an"):
provider = service_name # Too short/generic, use full name
section = find_section_by_content(sections, provider)
# Verify: the section must actually be about THIS service, not just mention the provider
if section and provider.lower() != service_name.lower():
# Check if the full service name or a close variant is in the section
content_lower = section.content.lower()
service_words = service_name.lower().split()
# At least 2 words of the service name must match (not just "Google")
matching_words = sum(1 for w in service_words if w in content_lower)
if matching_words < 2 and service_name.lower() not in content_lower:
section = None # False match — provider name found but wrong context
if section:
original = _extract_relevant_paragraph(section.content, provider)