Benjamin Admin
02879a2c3a
refactor: split cookie_screenshot_ocr.py (642 → 290 + 353 LOC)
...
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m19s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 29s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI hard-cap 500 LOC. cookie_screenshot_ocr.py war auf 642 gewachsen,
also gesplittet:
- cookie_screenshot_ocr_engines.py (353 LOC, NEU)
OCR-Engine-Funktionen: _slice_screenshot, Vision-LLM (qwen2.5vl),
PaddleOCR, Tesseract, parse_ocr_cookie_table, parse_vision_response,
Konstanten VISION_MODEL/OLLAMA_URL/VISION_PROMPT.
- cookie_screenshot_ocr.py (290 LOC, REWRITE)
Orchestration: capture_cookie_evidence_slices, _ocr_one_slice,
ocr_slices_extract_cookies, capture_cookie_screenshot,
extract_cookies_via_vision, cookies_to_vendor_records.
Re-Exports der Engine-Funktionen für Backward-Kompat.
Einziger externer Importer (_phase_d1_vendors_raw.py) braucht keinen
Code-Change — Public-API stabil.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-06-06 23:35:33 +02:00
Benjamin Admin
d2f26e70c6
perf(audit): parallel Tesseract OCR + Pipeline-Wire-In für Slicing
...
ocr_slices_extract_cookies nutzt jetzt ThreadPoolExecutor (4 workers).
Tesseract released die GIL, daher echtes parallelisieren möglich.
Sequenziell 32 slices ≈ 60s, parallel ~15s.
Pipeline in agent_compliance_check_routes.py: Step C ruft jetzt
capture_cookie_evidence_slices + ocr_slices_extract_cookies. Source
'tesseract_ocr' wird zu existing Vendors gemergt; neue Vendors als
eigenständige Records.
Final VW-Scan-Resultat:
- Cookies: 60 (parse_flat) → 128 (mit Tesseract) = +113%
- Vendors: 18 unique
- Adobe Analytics: 9 → 33 Cookies (Tesseract fand +24)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-23 06:36:16 +02:00
Benjamin Admin
efeef73f90
feat(audit): overlapping evidence-slices fuer lueckenlose Beweiskette
...
Statt EIN full-page screenshot: full-page wird per PIL in viewport-grosse
Slices geschnitten, jede ueberlappt die vorherige um overlap_px Pixel.
Jeder Cookie erscheint in mind. einer Slice, an Slice-Grenzen sogar in
zwei → Dedup nach Name eliminiert die Doppel.
Warum nicht direkt scroll-based slicing in Playwright? VW's
Cookie-Page nutzt scroll-snap / fixed-position — alle viewport-shots
kamen identisch zurueck (Header-Overlay). PIL-cut auf dem full-page
PNG bypasst das Problem voellig.
VW smoke-test (32 slices):
per-slice: [0, 0, 2, 5, 5, 3, 4, 7, 4, 3, 4, 5, ...]
103 raw cookies → 79 unique nach dedup
14 vendor records (Google 9, Adobe-Familie 17, etc.)
Jeder Slice hat eigenen Timestamp + SHA256 → ZIP-Anhang fuer
juristische Beweiskette.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-22 23:38:13 +02:00
Benjamin Admin
1784b43d72
feat(audit): Screenshot+Tesseract-OCR Cookie-Extract als Vendor-Quelle C
...
Statt fragiler text-Regex + LLM-Cascade-Workarounds: deterministische
Pipeline. consent-tester macht Full-Page-Screenshot der Cookie-Richtlinie
(akzeptiert Banner, klappt Accordions, brennt Timestamp ein). Backend
laesst Tesseract OCR (deu, PSM 4) drueber + anchor-basierter Parser
extrahiert {name, category, purpose, duration, type} pro Cookie.
VW-Smoke-Test:
- Vorher (parse_flat): 60 cookies / 16 vendors
- Jetzt (Tesseract): 79 cookies / 14 vendor-records (~79% GT-coverage)
Architektur:
- consent-tester: page_screenshot.py + /capture-evidence Endpoint
- backend: cookie_screenshot_ocr.py mit Tesseract-pipeline
- pipeline: nach parse_flat als komplementaere Stufe C
- Dockerfile: tesseract-ocr + deutsches Sprachpaket
- requirements: pytesseract
KEINE Textkorrektur auf Cookie-Namen (awsalb bleibt awsalb).
Timestamp im Screenshot = juristischer Beweis was wir zum Scan-Zeitpunkt
wirklich auf der Site gesehen haben.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-22 23:22:35 +02:00