Files
breakpilot-compliance/backend-compliance/compliance/api/agent_check/_b20_wiring.py
T
Benjamin Admin 5c5d676f01
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / loc-budget (push) Failing after 11s
CI / python-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
feat: Plan B + A + C — DSE-Versions-MCs + Legacy-URL + Multi-Version
Drei verwandte Mechanismen für DSE-Beweisbarkeit + URL-Hygiene.

Plan B + PDF — Versions-Beweisbarkeit-MCs (dse_checks.py):
  - mc-dse_version_date (HIGH) — sichtbares Stand/Versionsdatum
    Pflicht. 12 Regex-Pattern: "Stand: April 2024", ISO-Datum,
    "Letzte Aktualisierung", "Version 3.2", englische
    Varianten ("Last updated", "Effective date as of …").
    Norm: Art. 7 Abs. 1 DSGVO (Nachweisbarkeit Einwilligung).
  - mc-dse_version_proof (MED) — PDF-Download oder
    versionierte Archiv-URL. Reine HTML-DSE ohne Snapshot ist
    juristisch fragil. 8 Pattern: .pdf, Download-Hinweis,
    web.archive.org, /dse-vNNN.html.
    Norm: DSK-Orientierungshilfe 2024.

Plan A — Legacy-URL-Discovery (legacy_url_discovery.py + B20):
  Vier komplementäre Quellen:
    A.1 /sitemap.xml + Sub-Sitemaps parsen, auf compliance-
        relevante Slugs filtern
    A.2 archive.org/wayback/available pro Slug — wenn Wayback
        zeigt ≥18 Monate alten Snapshot UND Seite heute noch
        200 liefert UND nicht im Footer → Legacy-Verdacht
    A.3 Slug-Permutations: 6 doc_types × 6 Slug-Varianten ×
        5 Lang-Prefixe × 4 Brand-Parameter
    A.4 Banner-Modal-Links (über consent-tester Stufe 4 Tour)
  Mail-Block "🗂️ Legacy-URL-Inventar" mit Tabelle: URL · HTTP ·
  Wayback-Alter · Footer · Empfehlung (301/Offline/Behalten).
  Engine entscheidet NICHT was Legacy ist — präsentiert das
  Inventar, Kunde wählt.

  Real-World-Smoke Elli:
    /en/cookies → HTTP 200, Wayback 69 Mo alt, nicht im Footer
                  → "Legacy-Verdacht, 301 setzen"
    /en/impressum → HTTP 302, redirected → "behalten"

Plan C — Multi-Version-DSE-Analyse (multi_version_dse.py):
  Wenn ≥2 DSE-URLs reachable: pro Variante DSB-Name + Datum +
  Wortzahl + SHA-256 extrahieren, Inkonsistenzen flaggen
  (date_divergent, dsb_divergent, no_date_count).
  Mail-Block "📑 Mehrere DSE-Versionen erkannt" mit
  Vergleichstabelle + rotem Hinweis "Nur eine Version kann
  gültig sein". Beispiel Elli: /de/datenschutz (Mollstr-DSB,
  2022) vs /de/datenschutzerklaerung?brand=elli (Proliance,
  ohne Datum).

API-Response erweitert um legacy_url_inventory +
html_blocks.legacy_urls + multi_version_dse_html im V2-Layout.

ENV-Override: LEGACY_URL_DISABLED=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 10:04:14 +02:00

122 lines
4.7 KiB
Python

"""B20 wiring — Legacy-URL-Discovery + Mail-Block."""
from __future__ import annotations
import html
import logging
import os
from compliance.services.legacy_url_discovery import discover_legacy_urls
from compliance.services.multi_version_dse import (
analyze_multiple_dse_versions, render_multi_version_block,
)
logger = logging.getLogger(__name__)
_DISABLED = os.environ.get("LEGACY_URL_DISABLED", "").lower() in (
"1", "true", "yes",
)
async def run_b20(state: dict) -> None:
if _DISABLED:
return
try:
result = await discover_legacy_urls(state)
except Exception as e:
logger.warning("legacy-url-discovery failed: %s", e)
return
candidates = result.get("candidates") or []
state["legacy_url_inventory"] = result
if candidates:
state["legacy_url_html"] = _render(result)
logger.info(
"B20 legacy-url: %d candidates of %d probed",
len(candidates), result.get("probed", 0),
)
# Plan C — Multi-Version-DSE-Analyse: falls Legacy-Discovery zusätz-
# liche DSE-URLs liefert UND ≥2 reachable sind, parallele Analyse +
# Vergleichsblock.
try:
mv_info = await analyze_multiple_dse_versions(state)
if mv_info.get("versions") and len(mv_info["versions"]) >= 2:
state["multi_version_dse_info"] = mv_info
state["multi_version_dse_html"] = render_multi_version_block(
mv_info,
)
logger.info(
"B20-C multi-version-dse: %d versions, date_div=%s dsb_div=%s",
len(mv_info["versions"]),
mv_info.get("date_divergent"),
mv_info.get("dsb_divergent"),
)
except Exception as e:
logger.warning("multi-version-dse analysis failed: %s", e)
def _render(result: dict) -> str:
candidates = result.get("candidates") or []
if not candidates:
return ""
rows = []
for c in candidates[:25]:
st = c["status"]
sev_color = (
"#dc2626" if "Legacy-Verdacht" in (c.get("recommendation") or "")
else "#f59e0b" if st in (404, 410) else "#64748b"
)
age = c.get("age_months")
age_disp = f"{age} Mo." if age is not None else ""
rec = c.get("recommendation") or ""
rows.append(
f"<tr>"
f"<td style='padding:5px 8px;font-family:monospace;color:#475569;"
f"font-size:11px;max-width:380px;word-break:break-all;'>"
f"<a href='{html.escape(c['url'])}' "
f"style='color:{sev_color};'>{html.escape(c['url'][:120])}</a>"
f"</td>"
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
f"<strong style='color:{sev_color};'>{st or '?'}</strong></td>"
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
f"{age_disp}</td>"
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
f"{'' if c.get('in_footer') else ''}</td>"
f"<td style='padding:5px 8px;font-size:11px;color:#475569;'>"
f"{html.escape(rec)}</td>"
f"</tr>"
)
rest = ""
if len(candidates) > 25:
rest = (
f"<p style='font-size:12px;color:#64748b;margin-top:6px;'>"
f"<em>… und {len(candidates)-25} weitere — vollständig in "
f"<code>legacy-urls.csv</code> im ZIP-Anhang.</em></p>"
)
return (
"<div style='margin:24px 0;padding:16px;border-left:4px solid #0f766e;"
"background:#f0fdfa;border-radius:4px;'>"
"<h2 style='margin:0 0 8px;color:#134e4a;font-size:16px;'>"
f"🗂️ Legacy-URL-Inventar ({len(candidates)} Kandidaten von "
f"{result.get('probed', '?')} geprüft)"
"</h2>"
"<p style='margin:0 0 8px;font-size:12px;color:#475569;'>"
"Quellen: /sitemap.xml + Wayback-Machine + Slug-Permutations. "
"Wir <strong>entscheiden nicht</strong> ob eine URL Legacy ist — "
"wir präsentieren das Inventar mit Status und Empfehlung. Der "
"Kunde entscheidet."
"</p>"
"<table style='font-size:11px;width:100%;border-collapse:collapse;"
"background:#fff;border-radius:4px;'>"
"<thead><tr style='background:#ccfbf1;'>"
"<th style='padding:6px 8px;text-align:left;'>URL</th>"
"<th style='padding:6px 8px;'>HTTP</th>"
"<th style='padding:6px 8px;'>Wayback-Alter</th>"
"<th style='padding:6px 8px;'>Footer</th>"
"<th style='padding:6px 8px;text-align:left;'>Empfehlung</th>"
"</tr></thead><tbody>" + "".join(rows) + "</tbody></table>"
+ rest +
"</div>"
)