feat: Plan B + A + C — DSE-Versions-MCs + Legacy-URL + Multi-Version
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / loc-budget (push) Failing after 11s
CI / python-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / loc-budget (push) Failing after 11s
CI / python-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
Drei verwandte Mechanismen für DSE-Beweisbarkeit + URL-Hygiene.
Plan B + PDF — Versions-Beweisbarkeit-MCs (dse_checks.py):
- mc-dse_version_date (HIGH) — sichtbares Stand/Versionsdatum
Pflicht. 12 Regex-Pattern: "Stand: April 2024", ISO-Datum,
"Letzte Aktualisierung", "Version 3.2", englische
Varianten ("Last updated", "Effective date as of …").
Norm: Art. 7 Abs. 1 DSGVO (Nachweisbarkeit Einwilligung).
- mc-dse_version_proof (MED) — PDF-Download oder
versionierte Archiv-URL. Reine HTML-DSE ohne Snapshot ist
juristisch fragil. 8 Pattern: .pdf, Download-Hinweis,
web.archive.org, /dse-vNNN.html.
Norm: DSK-Orientierungshilfe 2024.
Plan A — Legacy-URL-Discovery (legacy_url_discovery.py + B20):
Vier komplementäre Quellen:
A.1 /sitemap.xml + Sub-Sitemaps parsen, auf compliance-
relevante Slugs filtern
A.2 archive.org/wayback/available pro Slug — wenn Wayback
zeigt ≥18 Monate alten Snapshot UND Seite heute noch
200 liefert UND nicht im Footer → Legacy-Verdacht
A.3 Slug-Permutations: 6 doc_types × 6 Slug-Varianten ×
5 Lang-Prefixe × 4 Brand-Parameter
A.4 Banner-Modal-Links (über consent-tester Stufe 4 Tour)
Mail-Block "🗂️ Legacy-URL-Inventar" mit Tabelle: URL · HTTP ·
Wayback-Alter · Footer · Empfehlung (301/Offline/Behalten).
Engine entscheidet NICHT was Legacy ist — präsentiert das
Inventar, Kunde wählt.
Real-World-Smoke Elli:
/en/cookies → HTTP 200, Wayback 69 Mo alt, nicht im Footer
→ "Legacy-Verdacht, 301 setzen"
/en/impressum → HTTP 302, redirected → "behalten"
Plan C — Multi-Version-DSE-Analyse (multi_version_dse.py):
Wenn ≥2 DSE-URLs reachable: pro Variante DSB-Name + Datum +
Wortzahl + SHA-256 extrahieren, Inkonsistenzen flaggen
(date_divergent, dsb_divergent, no_date_count).
Mail-Block "📑 Mehrere DSE-Versionen erkannt" mit
Vergleichstabelle + rotem Hinweis "Nur eine Version kann
gültig sein". Beispiel Elli: /de/datenschutz (Mollstr-DSB,
2022) vs /de/datenschutzerklaerung?brand=elli (Proliance,
ohne Datum).
API-Response erweitert um legacy_url_inventory +
html_blocks.legacy_urls + multi_version_dse_html im V2-Layout.
ENV-Override: LEGACY_URL_DISABLED=1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,121 @@
|
||||
"""B20 wiring — Legacy-URL-Discovery + Mail-Block."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import logging
|
||||
import os
|
||||
|
||||
from compliance.services.legacy_url_discovery import discover_legacy_urls
|
||||
from compliance.services.multi_version_dse import (
|
||||
analyze_multiple_dse_versions, render_multi_version_block,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
_DISABLED = os.environ.get("LEGACY_URL_DISABLED", "").lower() in (
|
||||
"1", "true", "yes",
|
||||
)
|
||||
|
||||
|
||||
async def run_b20(state: dict) -> None:
|
||||
if _DISABLED:
|
||||
return
|
||||
try:
|
||||
result = await discover_legacy_urls(state)
|
||||
except Exception as e:
|
||||
logger.warning("legacy-url-discovery failed: %s", e)
|
||||
return
|
||||
candidates = result.get("candidates") or []
|
||||
state["legacy_url_inventory"] = result
|
||||
if candidates:
|
||||
state["legacy_url_html"] = _render(result)
|
||||
logger.info(
|
||||
"B20 legacy-url: %d candidates of %d probed",
|
||||
len(candidates), result.get("probed", 0),
|
||||
)
|
||||
|
||||
# Plan C — Multi-Version-DSE-Analyse: falls Legacy-Discovery zusätz-
|
||||
# liche DSE-URLs liefert UND ≥2 reachable sind, parallele Analyse +
|
||||
# Vergleichsblock.
|
||||
try:
|
||||
mv_info = await analyze_multiple_dse_versions(state)
|
||||
if mv_info.get("versions") and len(mv_info["versions"]) >= 2:
|
||||
state["multi_version_dse_info"] = mv_info
|
||||
state["multi_version_dse_html"] = render_multi_version_block(
|
||||
mv_info,
|
||||
)
|
||||
logger.info(
|
||||
"B20-C multi-version-dse: %d versions, date_div=%s dsb_div=%s",
|
||||
len(mv_info["versions"]),
|
||||
mv_info.get("date_divergent"),
|
||||
mv_info.get("dsb_divergent"),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning("multi-version-dse analysis failed: %s", e)
|
||||
|
||||
|
||||
def _render(result: dict) -> str:
|
||||
candidates = result.get("candidates") or []
|
||||
if not candidates:
|
||||
return ""
|
||||
rows = []
|
||||
for c in candidates[:25]:
|
||||
st = c["status"]
|
||||
sev_color = (
|
||||
"#dc2626" if "Legacy-Verdacht" in (c.get("recommendation") or "")
|
||||
else "#f59e0b" if st in (404, 410) else "#64748b"
|
||||
)
|
||||
age = c.get("age_months")
|
||||
age_disp = f"{age} Mo." if age is not None else "—"
|
||||
rec = c.get("recommendation") or "—"
|
||||
rows.append(
|
||||
f"<tr>"
|
||||
f"<td style='padding:5px 8px;font-family:monospace;color:#475569;"
|
||||
f"font-size:11px;max-width:380px;word-break:break-all;'>"
|
||||
f"<a href='{html.escape(c['url'])}' "
|
||||
f"style='color:{sev_color};'>{html.escape(c['url'][:120])}</a>"
|
||||
f"</td>"
|
||||
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
|
||||
f"<strong style='color:{sev_color};'>{st or '?'}</strong></td>"
|
||||
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
|
||||
f"{age_disp}</td>"
|
||||
f"<td style='padding:5px 8px;font-size:11px;text-align:center;'>"
|
||||
f"{'✓' if c.get('in_footer') else '—'}</td>"
|
||||
f"<td style='padding:5px 8px;font-size:11px;color:#475569;'>"
|
||||
f"{html.escape(rec)}</td>"
|
||||
f"</tr>"
|
||||
)
|
||||
rest = ""
|
||||
if len(candidates) > 25:
|
||||
rest = (
|
||||
f"<p style='font-size:12px;color:#64748b;margin-top:6px;'>"
|
||||
f"<em>… und {len(candidates)-25} weitere — vollständig in "
|
||||
f"<code>legacy-urls.csv</code> im ZIP-Anhang.</em></p>"
|
||||
)
|
||||
return (
|
||||
"<div style='margin:24px 0;padding:16px;border-left:4px solid #0f766e;"
|
||||
"background:#f0fdfa;border-radius:4px;'>"
|
||||
"<h2 style='margin:0 0 8px;color:#134e4a;font-size:16px;'>"
|
||||
f"🗂️ Legacy-URL-Inventar ({len(candidates)} Kandidaten von "
|
||||
f"{result.get('probed', '?')} geprüft)"
|
||||
"</h2>"
|
||||
"<p style='margin:0 0 8px;font-size:12px;color:#475569;'>"
|
||||
"Quellen: /sitemap.xml + Wayback-Machine + Slug-Permutations. "
|
||||
"Wir <strong>entscheiden nicht</strong> ob eine URL Legacy ist — "
|
||||
"wir präsentieren das Inventar mit Status und Empfehlung. Der "
|
||||
"Kunde entscheidet."
|
||||
"</p>"
|
||||
"<table style='font-size:11px;width:100%;border-collapse:collapse;"
|
||||
"background:#fff;border-radius:4px;'>"
|
||||
"<thead><tr style='background:#ccfbf1;'>"
|
||||
"<th style='padding:6px 8px;text-align:left;'>URL</th>"
|
||||
"<th style='padding:6px 8px;'>HTTP</th>"
|
||||
"<th style='padding:6px 8px;'>Wayback-Alter</th>"
|
||||
"<th style='padding:6px 8px;'>Footer</th>"
|
||||
"<th style='padding:6px 8px;text-align:left;'>Empfehlung</th>"
|
||||
"</tr></thead><tbody>" + "".join(rows) + "</tbody></table>"
|
||||
+ rest +
|
||||
"</div>"
|
||||
)
|
||||
@@ -30,6 +30,7 @@ from ._b16_wiring import run_b16
|
||||
from ._b17_wiring import run_b17
|
||||
from ._b18_wiring import run_b18
|
||||
from ._b19_wiring import run_b19
|
||||
from ._b20_wiring import run_b20
|
||||
from ._constants import _compliance_check_jobs
|
||||
from ._phase_a_resolve import run_phase_a
|
||||
from ._phase_b_profile_check import run_phase_b
|
||||
@@ -94,6 +95,7 @@ async def run_compliance_check(check_id: str, req) -> None:
|
||||
await run_b17(state) # Audit-Walk-Video (Beweis-Aufzeichnung)
|
||||
await run_b18(state) # Impressum-Specialist-Agent (Pattern+LLM)
|
||||
run_b19(state) # Cookie-Coherence (Salesforce-as-essential)
|
||||
await run_b20(state) # Legacy-URL-Discovery (Sitemap+Wayback)
|
||||
# Phase D-3 top/mid/bot: Step 5 HTML blocks
|
||||
await run_phase_d3_top(state)
|
||||
await run_phase_d3_mid(state)
|
||||
|
||||
@@ -90,7 +90,9 @@ def run_phase_f(state: dict) -> None:
|
||||
"ai_act": state.get("ai_act_html", ""),
|
||||
"impressum_agent": state.get("impressum_agent_html", ""),
|
||||
"cookie_coherence": state.get("cookie_coherence_html", ""),
|
||||
"legacy_urls": state.get("legacy_url_html", ""),
|
||||
},
|
||||
"legacy_url_inventory": state.get("legacy_url_inventory") or None,
|
||||
}
|
||||
|
||||
_compliance_check_jobs[check_id]["status"] = "completed"
|
||||
|
||||
Reference in New Issue
Block a user