1784b43d72
Statt fragiler text-Regex + LLM-Cascade-Workarounds: deterministische
Pipeline. consent-tester macht Full-Page-Screenshot der Cookie-Richtlinie
(akzeptiert Banner, klappt Accordions, brennt Timestamp ein). Backend
laesst Tesseract OCR (deu, PSM 4) drueber + anchor-basierter Parser
extrahiert {name, category, purpose, duration, type} pro Cookie.
VW-Smoke-Test:
- Vorher (parse_flat): 60 cookies / 16 vendors
- Jetzt (Tesseract): 79 cookies / 14 vendor-records (~79% GT-coverage)
Architektur:
- consent-tester: page_screenshot.py + /capture-evidence Endpoint
- backend: cookie_screenshot_ocr.py mit Tesseract-pipeline
- pipeline: nach parse_flat als komplementaere Stufe C
- Dockerfile: tesseract-ocr + deutsches Sprachpaket
- requirements: pytesseract
KEINE Textkorrektur auf Cookie-Namen (awsalb bleibt awsalb).
Timestamp im Screenshot = juristischer Beweis was wir zum Scan-Zeitpunkt
wirklich auf der Site gesehen haben.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
73 lines
2.0 KiB
Docker
73 lines
2.0 KiB
Docker
# ---- Build stage ----
|
|
FROM python:3.12-slim-bookworm AS builder
|
|
|
|
WORKDIR /app
|
|
|
|
# Install build dependencies
|
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
build-essential \
|
|
libpq-dev \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Copy requirements first for better caching
|
|
COPY requirements.txt requirements-reranker.txt ./
|
|
|
|
# Create virtual environment and install dependencies
|
|
RUN python -m venv /opt/venv
|
|
ENV PATH="/opt/venv/bin:$PATH"
|
|
RUN pip install --no-cache-dir --upgrade pip && \
|
|
pip install --no-cache-dir -r requirements.txt && \
|
|
pip install --no-cache-dir -r requirements-reranker.txt || \
|
|
echo "WARNING: reranker dependencies not installed (torch/sentence-transformers)"
|
|
|
|
# ---- Runtime stage ----
|
|
FROM python:3.12-slim-bookworm
|
|
|
|
WORKDIR /app
|
|
|
|
# Install runtime dependencies for WeasyPrint (PDF generation) + Tesseract OCR
|
|
# (Cookie-Richtlinie Screenshot-Extraktion via cookie_screenshot_ocr.py).
|
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
libpango-1.0-0 \
|
|
libpangocairo-1.0-0 \
|
|
libgdk-pixbuf-2.0-0 \
|
|
libffi-dev \
|
|
shared-mime-info \
|
|
curl \
|
|
tesseract-ocr \
|
|
tesseract-ocr-deu \
|
|
tesseract-ocr-eng \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Copy virtual environment from builder
|
|
COPY --from=builder /opt/venv /opt/venv
|
|
ENV PATH="/opt/venv/bin:$PATH"
|
|
|
|
# Create non-root user + pre-create /data so volume mount inherits ownership
|
|
RUN useradd --create-home --shell /bin/bash appuser && \
|
|
mkdir -p /data && chown appuser:appuser /data
|
|
|
|
# Copy application code
|
|
COPY --chown=appuser:appuser . .
|
|
|
|
# Switch to non-root user
|
|
USER appuser
|
|
|
|
# Environment variables
|
|
ENV PYTHONUNBUFFERED=1
|
|
ENV PYTHONDONTWRITEBYTECODE=1
|
|
|
|
# Expose port
|
|
EXPOSE 8002
|
|
|
|
# Health check
|
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
|
CMD curl -f http://127.0.0.1:8002/health || exit 1
|
|
|
|
# P83 — Build-SHA fuer check-rebuild-needed.sh
|
|
ARG BUILD_SHA="unknown"
|
|
ENV BUILD_SHA=${BUILD_SHA}
|
|
|
|
# Run the application
|
|
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8002"]
|