fix: Eliminate GA false positive + handle short DSI documents

Service detection:
- Only search script tags + src/href attributes for service patterns
- Prevents false positives from DSE text mentioning services
  (e.g. IHK DSE describes etracker, 'google analytics' in text)
- Technical patterns (with regex chars) still checked in full HTML

Short documents:
- Documents with < 200 words flagged as 'Kurzhinweis' instead of
  'MANGELHAFT' — too short for Art. 13 completeness check
- Prevents 96-word navigation pages from showing 8 missing fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-05 18:21:37 +02:00
parent 8d6959e8b2
commit 8fb2061e9b
2 changed files with 35 additions and 5 deletions
@@ -188,6 +188,23 @@ def check_document_completeness(
})
return findings
# Short documents (< 200 words) are likely navigation snippets or
# introductory pages, not full Art. 13 documents — flag but don't check
word_count = len(text.split())
if word_count < 200 and doc_type == "dse":
findings.append({
"code": f"DSI-SCORE-{doc_type.upper()}",
"severity": "LOW",
"text": (
f"'{doc_title}': Kurzhinweis ({word_count} Woerter) — zu kurz fuer "
f"eine vollstaendige Art. 13 DSGVO Pruefung. Kein eigenstaendiges DSI-Dokument."
),
"doc_title": doc_title,
"doc_url": doc_url,
"doc_type": doc_type,
})
return findings
# Select checklist based on document type
if doc_type in ("dse", "datenschutz", "privacy"):
checklist = ART13_CHECKLIST