5c5d676f01
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / loc-budget (push) Failing after 11s
CI / python-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
Drei verwandte Mechanismen für DSE-Beweisbarkeit + URL-Hygiene.
Plan B + PDF — Versions-Beweisbarkeit-MCs (dse_checks.py):
- mc-dse_version_date (HIGH) — sichtbares Stand/Versionsdatum
Pflicht. 12 Regex-Pattern: "Stand: April 2024", ISO-Datum,
"Letzte Aktualisierung", "Version 3.2", englische
Varianten ("Last updated", "Effective date as of …").
Norm: Art. 7 Abs. 1 DSGVO (Nachweisbarkeit Einwilligung).
- mc-dse_version_proof (MED) — PDF-Download oder
versionierte Archiv-URL. Reine HTML-DSE ohne Snapshot ist
juristisch fragil. 8 Pattern: .pdf, Download-Hinweis,
web.archive.org, /dse-vNNN.html.
Norm: DSK-Orientierungshilfe 2024.
Plan A — Legacy-URL-Discovery (legacy_url_discovery.py + B20):
Vier komplementäre Quellen:
A.1 /sitemap.xml + Sub-Sitemaps parsen, auf compliance-
relevante Slugs filtern
A.2 archive.org/wayback/available pro Slug — wenn Wayback
zeigt ≥18 Monate alten Snapshot UND Seite heute noch
200 liefert UND nicht im Footer → Legacy-Verdacht
A.3 Slug-Permutations: 6 doc_types × 6 Slug-Varianten ×
5 Lang-Prefixe × 4 Brand-Parameter
A.4 Banner-Modal-Links (über consent-tester Stufe 4 Tour)
Mail-Block "🗂️ Legacy-URL-Inventar" mit Tabelle: URL · HTTP ·
Wayback-Alter · Footer · Empfehlung (301/Offline/Behalten).
Engine entscheidet NICHT was Legacy ist — präsentiert das
Inventar, Kunde wählt.
Real-World-Smoke Elli:
/en/cookies → HTTP 200, Wayback 69 Mo alt, nicht im Footer
→ "Legacy-Verdacht, 301 setzen"
/en/impressum → HTTP 302, redirected → "behalten"
Plan C — Multi-Version-DSE-Analyse (multi_version_dse.py):
Wenn ≥2 DSE-URLs reachable: pro Variante DSB-Name + Datum +
Wortzahl + SHA-256 extrahieren, Inkonsistenzen flaggen
(date_divergent, dsb_divergent, no_date_count).
Mail-Block "📑 Mehrere DSE-Versionen erkannt" mit
Vergleichstabelle + rotem Hinweis "Nur eine Version kann
gültig sein". Beispiel Elli: /de/datenschutz (Mollstr-DSB,
2022) vs /de/datenschutzerklaerung?brand=elli (Proliance,
ohne Datum).
API-Response erweitert um legacy_url_inventory +
html_blocks.legacy_urls + multi_version_dse_html im V2-Layout.
ENV-Override: LEGACY_URL_DISABLED=1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
122 lines
4.7 KiB
Python
122 lines
4.7 KiB
Python
"""B20 wiring — Legacy-URL-Discovery + Mail-Block."""
|
|
|
|
from __future__ import annotations
|
|
|
|
import html
|
|
import logging
|
|
import os
|
|
|
|
from compliance.services.legacy_url_discovery import discover_legacy_urls
|
|
from compliance.services.multi_version_dse import (
|
|
analyze_multiple_dse_versions, render_multi_version_block,
|
|
)
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
_DISABLED = os.environ.get("LEGACY_URL_DISABLED", "").lower() in (
|
|
"1", "true", "yes",
|
|
)
|
|
|
|
|
|
async def run_b20(state: dict) -> None:
|
|
if _DISABLED:
|
|
return
|
|
try:
|
|
result = await discover_legacy_urls(state)
|
|
except Exception as e:
|
|
logger.warning("legacy-url-discovery failed: %s", e)
|
|
return
|
|
candidates = result.get("candidates") or []
|
|
state["legacy_url_inventory"] = result
|
|
if candidates:
|
|
state["legacy_url_html"] = _render(result)
|
|
logger.info(
|
|
"B20 legacy-url: %d candidates of %d probed",
|
|
len(candidates), result.get("probed", 0),
|
|
)
|
|
|
|
# Plan C — Multi-Version-DSE-Analyse: falls Legacy-Discovery zusätz-
|
|
# liche DSE-URLs liefert UND ≥2 reachable sind, parallele Analyse +
|
|
# Vergleichsblock.
|
|
try:
|
|
mv_info = await analyze_multiple_dse_versions(state)
|
|
if mv_info.get("versions") and len(mv_info["versions"]) >= 2:
|
|
state["multi_version_dse_info"] = mv_info
|
|
state["multi_version_dse_html"] = render_multi_version_block(
|
|
mv_info,
|
|
)
|
|
logger.info(
|
|
"B20-C multi-version-dse: %d versions, date_div=%s dsb_div=%s",
|
|
len(mv_info["versions"]),
|
|
mv_info.get("date_divergent"),
|
|
mv_info.get("dsb_divergent"),
|
|
)
|
|
except Exception as e:
|
|
logger.warning("multi-version-dse analysis failed: %s", e)
|
|
|
|
|
|
def _render(result: dict) -> str:
|
|
candidates = result.get("candidates") or []
|
|
if not candidates:
|
|
return ""
|
|
rows = []
|
|
for c in candidates[:25]:
|
|
st = c["status"]
|
|
sev_color = (
|
|
"#dc2626" if "Legacy-Verdacht" in (c.get("recommendation") or "")
|
|
else "#f59e0b" if st in (404, 410) else "#64748b"
|
|
)
|
|
age = c.get("age_months")
|
|
age_disp = f"{age} Mo." if age is not None else "—"
|
|
rec = c.get("recommendation") or "—"
|
|
rows.append(
|
|
f"<tr>"
|
|
f"<td style='padding:5px 8px;font-family:monospace;color:#475569;"
|
|
f"font-size:11px;max-width:380px;word-break:break-all;'>"
|
|
f"<a href='{html.escape(c['url'])}' "
|
|
f"style='color:{sev_color};'>{html.escape(c['url'][:120])}</a>"
|
|
f"</td>"
|
|
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
|
|
f"<strong style='color:{sev_color};'>{st or '?'}</strong></td>"
|
|
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
|
|
f"{age_disp}</td>"
|
|
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
|
|
f"{'✓' if c.get('in_footer') else '—'}</td>"
|
|
f"<td style='padding:5px 8px;font-size:11px;color:#475569;'>"
|
|
f"{html.escape(rec)}</td>"
|
|
f"</tr>"
|
|
)
|
|
rest = ""
|
|
if len(candidates) > 25:
|
|
rest = (
|
|
f"<p style='font-size:12px;color:#64748b;margin-top:6px;'>"
|
|
f"<em>… und {len(candidates)-25} weitere — vollständig in "
|
|
f"<code>legacy-urls.csv</code> im ZIP-Anhang.</em></p>"
|
|
)
|
|
return (
|
|
"<div style='margin:24px 0;padding:16px;border-left:4px solid #0f766e;"
|
|
"background:#f0fdfa;border-radius:4px;'>"
|
|
"<h2 style='margin:0 0 8px;color:#134e4a;font-size:16px;'>"
|
|
f"🗂️ Legacy-URL-Inventar ({len(candidates)} Kandidaten von "
|
|
f"{result.get('probed', '?')} geprüft)"
|
|
"</h2>"
|
|
"<p style='margin:0 0 8px;font-size:12px;color:#475569;'>"
|
|
"Quellen: /sitemap.xml + Wayback-Machine + Slug-Permutations. "
|
|
"Wir <strong>entscheiden nicht</strong> ob eine URL Legacy ist — "
|
|
"wir präsentieren das Inventar mit Status und Empfehlung. Der "
|
|
"Kunde entscheidet."
|
|
"</p>"
|
|
"<table style='font-size:11px;width:100%;border-collapse:collapse;"
|
|
"background:#fff;border-radius:4px;'>"
|
|
"<thead><tr style='background:#ccfbf1;'>"
|
|
"<th style='padding:6px 8px;text-align:left;'>URL</th>"
|
|
"<th style='padding:6px 8px;'>HTTP</th>"
|
|
"<th style='padding:6px 8px;'>Wayback-Alter</th>"
|
|
"<th style='padding:6px 8px;'>Footer</th>"
|
|
"<th style='padding:6px 8px;text-align:left;'>Empfehlung</th>"
|
|
"</tr></thead><tbody>" + "".join(rows) + "</tbody></table>"
|
|
+ rest +
|
|
"</div>"
|
|
)
|