fix(impressum): P9 — 7 False-Positive-Fixes in Pflichtangaben-Checks
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / detect-changes (push) Successful in 10s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 16s
CI / loc-budget (push) Failing after 16s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

#1 Name des Anbieters: \b Word-Boundary verhindert "ag" in "samstag",
   plus "aktiengesellschaft" als Volltreffer.
#2 Vertretungsberechtigte: Klammer-Liste-Pattern erkennt jetzt BMW-
   Format "Vorstand (Milan Nedeljkovic, Jochen Goller, ...)" plus
   "Vorsitzender des Aufsichtsrats: Name".
#3 V.i.S.d.P.: war schon INFO, OK.
#4 OS-Plattform/VSBG: bei no_direct_sales=True (OEM-Pattern) jetzt als
   "Nicht anwendbar" skipped statt 0/1 fail. Profile fliesst neu durch
   check_document_completeness -> runner.
#5 Zustaendige Kammer: IHK + Handwerkskammer + Tieraerztekammer in
   Pattern aufgenommen + severity LOW -> INFO (konditional).
#6 Stammkapital: war schon INFO, OK.
#7 Link-Disclaimer: neue Check-Eigenschaft "invert"=True. Anti-Pattern
   ist passed wenn NICHT gefunden, fail wenn gefunden. Vorher feuerte
   das Finding immer, jetzt nur wenn ein illegaler Disclaimer im Text
   ist.

Plus: L2-INFO-Checks (z.B. profession_chamber) zaehlen nicht mehr in
correctness-pct und erzeugen keine DSI-DETAIL-Findings. Konsistent
mit P8-Modell: INFO = "selbst pruefen", nicht "fail".

Verifiziert mit BMW-Impressum-Text — alle 7 Faelle korrekt klassifiziert:
  name=passed, representative_person=passed, profession_chamber=INFO,
  illegal_disclaimer=passed (kein Disclaimer im Text),
  dispute_resolution=skipped (no_direct_sales),
  editorial_visdp=INFO, share_capital=INFO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-19 00:52:03 +02:00
parent 575644c9c5
commit 0d37822b7c
3 changed files with 62 additions and 13 deletions
@@ -5,6 +5,8 @@ Pass 1: Run all L1 checks ("Is it mentioned?")
Pass 2: Run L2 checks only where their L1 parent passed ("Is it correct?")
"""
from __future__ import annotations
import logging
import re
@@ -83,6 +85,7 @@ def check_document_completeness(
doc_type: str,
doc_title: str,
doc_url: str,
business_profile: dict | None = None,
) -> list[dict]:
"""Check a legal document against its type-specific requirements.
@@ -90,9 +93,20 @@ def check_document_completeness(
L1 — Is the mandatory field mentioned at all?
L2 — Is it correct/complete? (only checked if L1 parent passed)
business_profile (optional) wird genutzt um Checks die fuer das
spezifische Unternehmen nicht anwendbar sind als 'skipped' zu
markieren (z.B. OS-Plattform/VSBG bei no_direct_sales=True).
Returns a list of findings (summary + missing items).
"""
findings = []
no_direct_sales = bool((business_profile or {}).get("no_direct_sales"))
# P9: Welche Check-IDs sind bei OEM-Konfigurator-Pattern obsolet.
skip_check_ids: set[str] = set()
if no_direct_sales:
skip_check_ids.update([
"dispute_resolution", # OS-Plattform / VSBG nur B2C-Direkthaendler
])
# Strip soft hyphens (­ / \xad) that CMS tools insert for word-breaking
# — they break regex matches on compound words like "Datenübertragbarkeit"
text_clean = text.replace("\xad", "").replace("&shy;", "")
@@ -135,8 +149,25 @@ def check_document_completeness(
for check in l1_checks:
is_info = check.get("severity") == "INFO"
# P9: Profil-basiertes Skip (OEM-Pattern -> OS-Plattform raus)
if check["id"] in skip_check_ids:
all_checks.append({
"id": check["id"], "label": check["label"],
"passed": False, "severity": "INFO",
"matched_text": "", "level": 1, "parent": None,
"skipped": True,
"hint": "Nicht anwendbar: Unternehmen betreibt keinen "
"Direkt-Vertrieb an Verbraucher (OEM-Konfigurator-Pattern).",
})
continue
match = _match_patterns(check["patterns"], text_lower)
passed = match is not None
# P9: "invert"=True bedeutet Anti-Pattern (z.B. illegaler Link-
# Disclaimer): passed wenn NICHT gefunden, fail wenn gefunden.
if check.get("invert"):
passed = match is None
match = None if passed else match
else:
passed = match is not None
if passed:
passed_l1_ids.add(check["id"])
if not is_info:
@@ -168,18 +199,26 @@ def check_document_completeness(
for check in l2_checks:
parent = check.get("parent")
is_info = check.get("severity") == "INFO"
skipped = parent not in passed_l1_ids
passed = False
matched_text = ""
if not skipped:
l2_total += 1
match = _match_patterns(check["patterns"], text_lower)
passed = match is not None
if passed:
# P9: INFO-L2-Checks (konditional, z.B. Kammer) zaehlen NICHT
# in correctness-pct und erscheinen nicht als Fail-Finding.
if is_info:
if passed:
matched_text = _extract_context(text_lower, match)
# weder l2_total++ noch findings.append: kein Fail-Eintrag
else:
l2_total += 1
if passed and not is_info:
l2_passed += 1
matched_text = _extract_context(text_lower, match)
else:
elif not passed and not is_info:
findings.append({
"code": f"DSI-DETAIL-{check['id'].upper()}",
"severity": check.get("severity", "MEDIUM"),