feat(crawl): Vollstaendigkeit — Shadow-DOM/versteckte Links + Interaktions-Fixpunkt + Wayback-CDX-Orphans
CI / test-python-backend (push) Successful in 30s
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 12s
CI / loc-budget (push) Successful in 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 12s
CI / loc-budget (push) Successful in 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Damit die Specialist-Agents auf vollstaendigem Website-Content arbeiten:
A — _find_dsi_links pierct jetzt Shadow-DOM (Web-Components wie Usercentrics/
Mercedes) rekursiv; versteckte (display:none) Links werden erfasst + als
Coverage-Metadatum geflaggt.
B — _expand_to_fixpoint klappt Akkordeons/Tabs/Hover-Menues in einer Schleife
auf, bis das DOM stabil ist (statt 1 Pass); erweiterte Selektoren;
Coverage-Telemetrie (Runden, expandierte Elemente, DOM-Wachstum, Shadow-/
versteckte Links) → Response + Backend-Log.
C — legacy_url_cdx.cdx_enumerate listet via Wayback-CDX-API ALLE je
archivierten URLs der Domain → findet Orphan-/Legacy-Seiten, die nie im
Slug-Raster standen (z.B. nicht mehr verlinktes /datenschutz, per Direkt-
URL noch erreichbar). Fliesst durch das bestehende Legacy-URL-Inventar.
Tests: test_legacy_url_cdx.py (6) + consent-tester/tests/test_dsi_discovery.py
(Pure-Helper + Real-Browser-Integration). Alle gruen, LOC-Gate gruen.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -29,6 +29,8 @@ from urllib.parse import urljoin, urlparse
|
||||
|
||||
import httpx
|
||||
|
||||
from compliance.services.legacy_url_cdx import cdx_enumerate
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@@ -239,13 +241,24 @@ async def discover_legacy_urls(state: dict) -> dict:
|
||||
return {"candidates": [], "skipped": "no_origin"}
|
||||
|
||||
candidates: set[str] = set()
|
||||
# A.1 Sitemap
|
||||
# A.1 Sitemap + A.3 Slug-Permutations
|
||||
for o in list(origins)[:2]:
|
||||
sitemap_urls = await _fetch_sitemap_urls(o)
|
||||
candidates.update(_filter_legal_urls(sitemap_urls))
|
||||
# A.3 Slug-Permutations
|
||||
candidates.update(_build_slug_candidates(o))
|
||||
|
||||
# A.5 Wayback-CDX: alle je archivierten URLs der Domain → faengt
|
||||
# Orphans, die nie im Slug-Raster standen. (url, cdx_timestamp); der
|
||||
# timestamp dient als Legacy-Alter (kein zweiter Wayback-Call noetig).
|
||||
cdx_pairs: list[tuple[str, str]] = []
|
||||
for o in list(origins)[:2]:
|
||||
cdx_pairs.extend(await cdx_enumerate(o))
|
||||
cdx_legal_urls = set(_filter_legal_urls([u for u, _ in cdx_pairs]))
|
||||
cdx_legal = [
|
||||
(u, ts) for (u, ts) in cdx_pairs
|
||||
if u in cdx_legal_urls and u not in candidates
|
||||
][:100]
|
||||
|
||||
# Cap to avoid explosion
|
||||
cands = list(candidates)[:60]
|
||||
|
||||
@@ -264,12 +277,32 @@ async def discover_legacy_urls(state: dict) -> dict:
|
||||
"age_months": age,
|
||||
"in_footer": in_footer,
|
||||
"recommendation": _recommend(status, age, False, in_footer),
|
||||
"via": "sitemap/slug",
|
||||
}
|
||||
|
||||
results = await asyncio.gather(
|
||||
*[_check(u) for u in cands], return_exceptions=True,
|
||||
# CDX-Kandidaten: nur Liveness pruefen (Archiv-Stand kennen wir schon).
|
||||
async def _check_cdx(url: str, ts: str) -> dict:
|
||||
status, lm = await _probe_alive(url)
|
||||
age = _months_since(ts)
|
||||
in_footer = url.split("#")[0].split("?")[0] in footer_urls
|
||||
return {
|
||||
"url": url,
|
||||
"status": status,
|
||||
"last_modified": lm,
|
||||
"wayback_snapshot": "",
|
||||
"wayback_timestamp": ts,
|
||||
"age_months": age,
|
||||
"in_footer": in_footer,
|
||||
"recommendation": _recommend(status, age, False, in_footer),
|
||||
"via": "wayback-cdx",
|
||||
}
|
||||
|
||||
gathered = await asyncio.gather(
|
||||
*[_check(u) for u in cands],
|
||||
*[_check_cdx(u, ts) for u, ts in cdx_legal],
|
||||
return_exceptions=True,
|
||||
)
|
||||
results = [r for r in results if isinstance(r, dict)]
|
||||
results = [r for r in gathered if isinstance(r, dict)]
|
||||
|
||||
# Filter: only show interesting ones (≥200 reachable + legacy-relevant)
|
||||
interesting: list[dict] = []
|
||||
@@ -297,5 +330,6 @@ async def discover_legacy_urls(state: dict) -> dict:
|
||||
"candidates": interesting,
|
||||
"probed": len(results),
|
||||
"filtered_kept": len(interesting),
|
||||
"cdx_candidates": len(cdx_legal),
|
||||
"origins": list(origins),
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user