Commit Graph

15 Commits

Author SHA1 Message Date
Benjamin Admin d45e08e25f fix: reduce Playwright timeout 180s→60s, increase poll limit 15→25min 2026-05-16 00:47:28 +02:00
Benjamin Admin 3dbf3aa34a feat: HTTP fallback for text extraction when Playwright times out
BMW Impressum/Cookie pages timeout in Playwright (>180s) because the
SPA has many sub-links to follow. But the HTML source already contains
the text (SSR). New fallback: direct HTTP GET + HTML tag stripping.

Order: 1. Consent-tester (Playwright, 180s) → 2. HTTP GET (30s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 23:16:10 +02:00
Benjamin Admin f34305c0a1 fix: increase dsi-discovery timeout 90s→300s, reduce max_documents 10→5 2026-05-15 14:21:13 +02:00
Benjamin Admin fca67c1f43 fix: accordion close bug + merge multi-page DSIs (BMW fix)
1. _expand_all_interactive(): Only click aria-expanded="false" buttons.
   Before: clicked ALL accordion buttons including open ones → BMW's
   pre-expanded accordions got CLOSED, reducing text from 1151 to 361w.

2. _fetch_text() + /extract-text: merge ALL documents found on a page
   (max_documents=10 instead of 1). BMW splits DSI across 5 sub-pages
   that the discovery finds as separate documents — now merged.

3. Tab panels: unhide hidden tabpanels instead of clicking tabs
   (clicking tabs can hide the currently visible panel).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 13:32:04 +02:00
Benjamin Admin 9f87bc5a2c fix: include website/company name in compliance-check email subject 2026-05-15 10:15:34 +02:00
Benjamin Admin d72aa10691 feat: management summary for GF + batch GT test script
1. Management Summary (agent_doc_check_report.py):
   - Plain-language action items for Geschaeftsfuehrer
   - Maps technical checks to business actions ("Ihren DSB erwaehnen",
     "Beschwerderecht ergaenzen", "Loeschfristen dokumentieren")
   - Shows at top of compliance check email before detail report
   - Max 10 actions, max 3 per document

2. Batch GT Test (zeroclaw/scripts/batch_gt_test.py):
   - Runs all 10 GT websites through compliance-check API
   - Prints comparison table with L1 scores, word counts, services
   - Saves raw JSON results for analysis
   - Usage: python3 batch_gt_test.py --sites 1,6 --backend-url URL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 09:39:19 +02:00
Benjamin Admin 826ce2a1b8 fix(cross-doc): suppress false positives when regex checks already pass
Cross-search "not in text" findings are only shown when regex L1
completeness < 50%. This prevents false positives where the text IS
the right doc_type but doesn't contain the specific cross-search
keywords (e.g. Impressum passes 9/13 checks but lacks "§5 TMG").

Also: cross-search now checks entries with wrong text, not just empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 00:54:33 +02:00
Benjamin Admin 4e9043f26d feat(cross-doc): search all texts for all doc_types + misplacement finding
Cross-Document Intelligence: When a doc_type row is empty, searches
ALL other loaded documents for that content. If found (e.g. Widerruf
in AGB), extracts the section, runs the check, AND creates a finding:
"Widerrufsbelehrung in falschem Dokument gefunden — schwer auffindbar"

Keywords for: widerruf, cookie, social_media, impressum, agb, dsb.
Integrated as Step 1c in compliance check pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 23:19:39 +02:00
Benjamin Admin c702260ec1 fix: 5 regex bugs + text extraction scroll + GT update
Build + Deploy / build-admin-compliance (push) Successful in 13s
Build + Deploy / build-backend-compliance (push) Successful in 23s
Build + Deploy / build-ai-sdk (push) Successful in 13s
Build + Deploy / build-developer-portal (push) Successful in 14s
Build + Deploy / build-tts (push) Successful in 15s
Build + Deploy / build-document-crawler (push) Successful in 13s
Build + Deploy / build-dsms-gateway (push) Successful in 15s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m26s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 39s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Root cause: Spiegel DSI text was truncated (lazy-loading) — the
rights/DSB/complaints sections at the bottom were never extracted.

Fixes:
1. Text extraction: scroll to bottom before innerText (dsi_discovery.py)
2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern
3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces)
4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel
   "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc.
5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media
   rows from sections found in the DSI text

Ground Truth 06-spiegel.md fully rewritten with verified data from
live website — 3 L1 False Negatives identified (DSB, Beschwerderecht,
Betroffenenrechte all present on website but not in extracted text).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 01:20:55 +02:00
Benjamin Admin c867478791 feat(tcf-vendors): GVL cache + vendor extraction + VVT mapping
Build + Deploy / build-admin-compliance (push) Successful in 14s
Build + Deploy / build-backend-compliance (push) Successful in 16s
Build + Deploy / build-ai-sdk (push) Successful in 20s
Build + Deploy / build-developer-portal (push) Successful in 12s
Build + Deploy / build-tts (push) Successful in 15s
Build + Deploy / build-document-crawler (push) Successful in 13s
Build + Deploy / build-dsms-gateway (push) Successful in 13s
Build + Deploy / build-dsms-node (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m49s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 38s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 23s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m23s
Phase 1-2 of the closed quality loop:
- GVL cache (consent-tester/services/gvl_cache.py): downloads and caches
  IAB Global Vendor List with 24h TTL, resolves vendor IDs to names,
  purposes, policy URLs, retention, country
- Vendor extraction (consent_interceptor.py): extract_tcf_vendors()
  reads __tcfapi after accept phase, resolves via GVL
- Scan response: tcf_vendors field added to /scan endpoint
- VVT mapper (vendor_vvt_mapper.py): maps TCF vendors to VVT format
  with purpose labels, Rechtsgrundlage, Drittland detection
- Vendor cross-check (banner_cookie_cross_check.py): checks all TCF
  vendors against DSI text — missing vendors, undocumented transfers
- Compliance check integrates Step 3d: TCF vendors vs DSI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 18:18:50 +02:00
Benjamin Admin 7be34552bb feat(compliance-check): profile extraction + scenario classification
Build + Deploy / build-admin-compliance (push) Successful in 15s
Build + Deploy / build-backend-compliance (push) Successful in 21s
Build + Deploy / build-ai-sdk (push) Successful in 46s
Build + Deploy / build-developer-portal (push) Successful in 12s
Build + Deploy / build-tts (push) Successful in 13s
Build + Deploy / build-document-crawler (push) Successful in 11s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m46s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 47s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 2m29s
- New profile_extractor.py: extracts Company Profile fields (name,
  legal form, address, DPO, USt-IdNr) and Compliance Scope hints
  (Art. 9 data, third country, profiling) from document texts
- Scenario per document: regenerate (<30%), fix (30-95%), import (>95%)
- Widerruf for B2B: no longer skipped, instead all checks flagged as
  INFO with "not needed for B2B" hint
- Move _build_profile_html to report builder module
- DocCheckResult gets scenario field

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 17:34:33 +02:00
Benjamin Admin be9cfdc2d4 feat(compliance-check): skip Widerruf for B2B, limit MCs, fix industry
Build + Deploy / build-admin-compliance (push) Successful in 2m1s
Build + Deploy / build-backend-compliance (push) Successful in 4m20s
Build + Deploy / build-ai-sdk (push) Successful in 53s
Build + Deploy / build-developer-portal (push) Successful in 2m6s
Build + Deploy / build-tts (push) Successful in 2m48s
Build + Deploy / build-document-crawler (push) Successful in 52s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
Build + Deploy / build-dsms-node (push) Successful in 13s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m45s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m17s
- Skip Widerrufsbelehrung check entirely for B2B/B2G businesses
- Limit MC checks to top 20 per doc_type (by severity) to reduce noise
  (e.g. 75 impressum MCs → 20, avoiding 55 irrelevant FAILs)
- Add consulting/manufacturing industry keywords (arbeitssicherheit,
  brandschutz, werkzeugbau, etc.)
- Lower industry detection threshold from 2 to 1 keyword hit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 17:03:57 +02:00
Benjamin Admin 74f00bbb0f feat(compliance-check): split shared URLs into sections per doc_type
Build + Deploy / build-admin-compliance (push) Successful in 2m4s
Build + Deploy / build-backend-compliance (push) Successful in 3m39s
Build + Deploy / build-ai-sdk (push) Successful in 50s
Build + Deploy / build-developer-portal (push) Successful in 1m12s
Build + Deploy / build-tts (push) Successful in 2m16s
Build + Deploy / build-document-crawler (push) Successful in 1m9s
Build + Deploy / build-dsms-gateway (push) Successful in 35s
Build + Deploy / build-dsms-node (push) Successful in 32s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m37s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 43s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m16s
When the same URL is used for multiple document types (e.g. /datenschutz
for DSI + Cookie + DSB), the section splitter now:
- Detects duplicate URLs and fetches text only once
- Splits text at classified headings (Cookie, Google Analytics, etc.)
- Assigns matching sections to each doc_type
- DSI always keeps the full text

Extracted to section_splitter.py (170 LOC) to keep routes under 500.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 12:49:57 +02:00
Benjamin Admin b6ad958b69 feat(compliance-check): integrate banner cross-check + extract to module
Build + Deploy / build-admin-compliance (push) Successful in 1m57s
Build + Deploy / build-backend-compliance (push) Successful in 3m20s
Build + Deploy / build-ai-sdk (push) Successful in 48s
Build + Deploy / build-developer-portal (push) Successful in 1m6s
Build + Deploy / build-tts (push) Successful in 1m43s
Build + Deploy / build-document-crawler (push) Successful in 44s
Build + Deploy / build-dsms-gateway (push) Successful in 31s
Build + Deploy / build-dsms-node (push) Successful in 18s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 16s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m40s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 47s
CI / test-python-backend (push) Successful in 38s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 20s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m26s
Add automatic banner check (Step 3b) and banner-vs-cookie cross-check
(Step 3c) to unified compliance check. Extract cross-check logic to
banner_cookie_cross_check.py to keep routes under 500 LOC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 00:08:47 +02:00
Benjamin Admin 0d0e705117 feat: Unified Compliance-Check — 8 document types in one form
New 3-tab structure: Website-Scan, Compliance-Check, Banner-Check.

Compliance-Check Tab (replaces Dokumenten-Pruefung + Impressum-Check):
- 8 document rows: DSI, Impressum, Social Media, Cookie, AGB,
  Nutzungsbedingungen, Widerruf, DSB-Kontakt
- Each row: URL input + "Text laden" + file upload + manual text
- "Text laden" extracts via consent-tester, shows in editable textarea
- User verifies/corrects text before checking
- Empty fields = "not present" → own finding

Business Profiler (business_profiler.py):
- Detects B2B/B2C/B2G from all documents together
- Recognizes regulated professions, online shops, editorial content
- Context-aware: INFO checks become PASS/FAIL based on profile

Backend: /compliance-check + /extract-text endpoints
Frontend: ComplianceCheckTab.tsx + DocumentRow.tsx
API proxies: compliance-check/route.ts + extract-text/route.ts

Also: Impressum regex fixes (Telefon, AG, Geschaeftsfuehrung)
and INFO severity for context-dependent checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 20:56:10 +02:00