feat(audit): Screenshot+Tesseract-OCR Cookie-Extract als Vendor-Quelle C
Statt fragiler text-Regex + LLM-Cascade-Workarounds: deterministische
Pipeline. consent-tester macht Full-Page-Screenshot der Cookie-Richtlinie
(akzeptiert Banner, klappt Accordions, brennt Timestamp ein). Backend
laesst Tesseract OCR (deu, PSM 4) drueber + anchor-basierter Parser
extrahiert {name, category, purpose, duration, type} pro Cookie.
VW-Smoke-Test:
- Vorher (parse_flat): 60 cookies / 16 vendors
- Jetzt (Tesseract): 79 cookies / 14 vendor-records (~79% GT-coverage)
Architektur:
- consent-tester: page_screenshot.py + /capture-evidence Endpoint
- backend: cookie_screenshot_ocr.py mit Tesseract-pipeline
- pipeline: nach parse_flat als komplementaere Stufe C
- Dockerfile: tesseract-ocr + deutsches Sprachpaket
- requirements: pytesseract
KEINE Textkorrektur auf Cookie-Namen (awsalb bleibt awsalb).
Timestamp im Screenshot = juristischer Beweis was wir zum Scan-Zeitpunkt
wirklich auf der Site gesehen haben.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -16,6 +16,7 @@ from services.consent_scanner import run_consent_test, ConsentTestResult
|
||||
from services.authenticated_scanner import run_authenticated_test, AuthTestResult
|
||||
from services.playwright_scanner import scan_website_playwright
|
||||
from services.dsi_discovery import discover_dsi_documents, DSIDiscoveryResult
|
||||
from services.page_screenshot import capture_page_evidence
|
||||
from checks.banner_runner import map_scan_to_checks
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(levelname)s:%(name)s: %(message)s")
|
||||
@@ -365,6 +366,47 @@ async def dsi_discovery(req: DSIDiscoveryRequest):
|
||||
)
|
||||
|
||||
|
||||
# ── Evidence screenshot (full-page + timestamp) ─────────────────────
|
||||
|
||||
class EvidenceRequest(BaseModel):
|
||||
url: str
|
||||
check_id: str = ""
|
||||
|
||||
|
||||
class EvidenceResponse(BaseModel):
|
||||
url: str # final URL after redirects
|
||||
captured_at: str
|
||||
width_px: int
|
||||
height_px: int
|
||||
accepted_banner: bool
|
||||
expanded: int
|
||||
png_b64: str
|
||||
png_size: int
|
||||
|
||||
|
||||
@app.post("/capture-evidence", response_model=EvidenceResponse)
|
||||
async def capture_evidence(req: EvidenceRequest):
|
||||
"""Full-page screenshot with timestamp banner — for legal evidence.
|
||||
|
||||
Used by backend to capture the Cookie-Richtlinie + DSE pages so the
|
||||
audit-mail ZIP-attachment contains the exact rendered DOM at scan time.
|
||||
"""
|
||||
import base64 as _b64
|
||||
logger.info("Capturing evidence screenshot for %s", req.url)
|
||||
data = await capture_page_evidence(req.url, check_id=req.check_id)
|
||||
png = data["png_bytes"]
|
||||
return EvidenceResponse(
|
||||
url=data["url"],
|
||||
captured_at=data["captured_at"],
|
||||
width_px=data["width_px"],
|
||||
height_px=data["height_px"],
|
||||
accepted_banner=data["accepted_banner"],
|
||||
expanded=data["expanded"],
|
||||
png_b64=_b64.b64encode(png).decode("ascii") if png else "",
|
||||
png_size=len(png) if png else 0,
|
||||
)
|
||||
|
||||
|
||||
# ── Admin: CMP discoveries (Phase E) ────────────────────────────────
|
||||
|
||||
@app.get("/cmp-discoveries")
|
||||
|
||||
Reference in New Issue
Block a user