fix: DSI self-extraction + banner L1/L2 check definitions
Build + Deploy / build-developer-portal (push) Successful in 1m26s
Build + Deploy / build-tts (push) Successful in 1m38s
Build + Deploy / build-document-crawler (push) Successful in 37s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / nodejs-build (push) Successful in 3m7s
CI / dep-audit (push) Has been skipped
Build + Deploy / build-admin-compliance (push) Successful in 2m22s
Build + Deploy / build-backend-compliance (push) Successful in 3m20s
Build + Deploy / build-ai-sdk (push) Successful in 54s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 30s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 17s
Build + Deploy / trigger-orca (push) Successful in 3m37s
CI / sbom-scan (push) Has been skipped
Build + Deploy / build-developer-portal (push) Successful in 1m26s
Build + Deploy / build-tts (push) Successful in 1m38s
Build + Deploy / build-document-crawler (push) Successful in 37s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / nodejs-build (push) Successful in 3m7s
CI / dep-audit (push) Has been skipped
Build + Deploy / build-admin-compliance (push) Successful in 2m22s
Build + Deploy / build-backend-compliance (push) Successful in 3m20s
Build + Deploy / build-ai-sdk (push) Successful in 54s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 30s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 17s
Build + Deploy / trigger-orca (push) Successful in 3m37s
CI / sbom-scan (push) Has been skipped
1. DSI Discovery fix for direct-URL use case (e.g. example.com/datenschutz):
- Self-extraction: if the URL itself is a DSE page, extract its text
directly from the page body (main/article/content element)
- Remove "datenschutz" from NOISE_TITLES — it's a legitimate doc title
- Fixes safetykon.de/datenschutz returning 0 documents
2. Banner check definitions (36 checks: 6 L1 + 30 L2):
- consent-tester/checks/banner_checks.py with expert-level hints
- EDPB 3/2022, CNIL rulings, EuGH C-673/17, §25 TDDDG references
- check_key maps to existing consent_scanner check codes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -220,6 +220,40 @@ async def discover_dsi_documents(
|
||||
await page.goto(url, wait_until="networkidle", timeout=60000)
|
||||
await page.wait_for_timeout(2000)
|
||||
|
||||
# Step 1b: Self-extraction — if the URL itself is a DSI page,
|
||||
# extract its full text as the first document. This handles the
|
||||
# case where the user provides the DSE URL directly (e.g.
|
||||
# example.com/datenschutz) instead of the homepage.
|
||||
current_url_path = urlparse(url).path.lower()
|
||||
is_self_dsi, self_lang = _matches_dsi_keyword(current_url_path)
|
||||
if not is_self_dsi:
|
||||
# Also check the page title
|
||||
page_title = await page.title() or ""
|
||||
is_self_dsi, self_lang = _matches_dsi_keyword(page_title)
|
||||
if is_self_dsi:
|
||||
try:
|
||||
self_text = await page.evaluate("""() => {
|
||||
const main = document.querySelector('main, article, [role="main"], .content, #content, .bodytext')
|
||||
|| document.body;
|
||||
return main ? main.innerText : document.body.innerText;
|
||||
}""")
|
||||
self_wc = len(self_text.split()) if self_text else 0
|
||||
if self_wc >= 100:
|
||||
page_title = await page.title() or url
|
||||
result.documents.append(DiscoveredDSI(
|
||||
title=page_title.strip(),
|
||||
url=url,
|
||||
source_url=url,
|
||||
language=self_lang or "de",
|
||||
doc_type="html_full_page",
|
||||
text=self_text.strip(),
|
||||
word_count=self_wc,
|
||||
))
|
||||
seen_urls.add(url)
|
||||
logger.info("Self-extracted %d words from %s", self_wc, url)
|
||||
except Exception as e:
|
||||
logger.warning("Self-extraction failed for %s: %s", url, e)
|
||||
|
||||
# Step 2: Find DSI links in current page
|
||||
links = await _find_dsi_links(page, base_domain)
|
||||
logger.info("Found %d DSI links on %s", len(links), url)
|
||||
@@ -360,8 +394,9 @@ async def discover_dsi_documents(
|
||||
return result
|
||||
|
||||
# Nav elements, not real documents
|
||||
# NOTE: "datenschutz" was removed — it's a legitimate document title
|
||||
NOISE_TITLES = {"drucken", "print", "nach oben", "back to top", "teilen", "share",
|
||||
"kontakt", "contact", "suche", "search", "menü", "menu", "home", "datenschutz"}
|
||||
"kontakt", "contact", "suche", "search", "menü", "menu", "home"}
|
||||
|
||||
def _deduplicate_documents(docs: list[DiscoveredDSI]) -> list[DiscoveredDSI]:
|
||||
"""Remove duplicate and noise documents."""
|
||||
|
||||
Reference in New Issue
Block a user