Commit Graph

20 Commits

Author SHA1 Message Date
Benjamin Admin 189918b043 fix(cmp): stricter heuristic + only replace DOM when CMP is strictly larger
Two bugs observed in BMW BMW test run:

1. Generic JSON heuristic captured /de-de/login/bmw/api/flyout/data (4KB,
   user login fly-out data) and reconstruct_generic produced 56 words of
   noise. The CMP-prefer logic then 'replaced' the 185-word imprint DOM
   extraction with those 56 words because self_wc(185) < 300 — even
   though cmp_wc(56) < self_wc(185).

2. The strict prefilter list was too short. Login/auth/cart endpoints
   often have category-shaped JSON without being cookie policies.

Fixes:
- dsi_discovery: replace DOM with CMP only when cmp_wc > self_wc AND
  meets one of the existing conditions. Tiny captures can no longer
  silently destroy a bigger DOM extraction.
- cmp_extractor: skip non-cookie URLs (/login, /auth, /user, /session,
  /cart, /checkout, /search, /flyout, /menu, /nav, /translation, /i18n,
  /locale, /feature-flag).
- cmp_extractor: require ≥5KB payload size — real CMP policies are
  always larger (BMW ePaaS is ~393KB). Tiny matches drop out before
  reconstruction.
2026-05-17 10:50:19 +02:00
Benjamin Admin ea4dbb223f feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1)
When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now
extract structured vendor records, probe their opt-out + privacy URLs,
score each vendor (0-100), and append a 'VVT-Vorschlag' table to the
compliance email — one row per vendor, sortable by compliance score.

consent-tester:
- DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers
- DSIDiscoveryResponse: new cmp_payloads field
- discover_dsi_documents sets cmp_payloads from cmp_capture
- cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning
  list[VendorRecord]

backend:
- _fetch_text() now returns (text, cmp_payloads) tuple
- doc_entries store cmp_payloads per doc (mostly cookie)
- _autodiscover_missing forwards homepage payloads to the cookie entry
- New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic
  schemas; dedupes vendors across multiple payloads
- cookie_link_validator.py extended with validate_vendor_urls(vendors)
  and score_vendors(vendors) — 0-100 score per vendor based on name,
  purpose, country, opt-out reachable, privacy URL reachable, cookies
  with names + expiry
- agent_doc_check_extras.build_vvt_table_html: renders the table
- Route appends VVT HTML after the provider list, before the
  document-by-document report
- Response JSON gains cmp_vendors for future frontend rendering

Example for BMW: ~30 ePaaS providers → table with Name | Kategorie |
Sitz | Cookies | Opt-Out (✓/✗) | Privacy (✓/✗) | Score. Sorted by
score ascending so the worst-compliant vendors are at the top.
2026-05-17 09:50:11 +02:00
Benjamin Admin 5f2da1de88 feat(consent-tester): Phase E — self-improving CMP library
cmp_discovery_log.py:
- sqlite log at /data/cmp_discoveries.db: every LLM-discovered CMP
  pattern recorded with domain, strategy, value, sample text
- Auto-promote (user-chosen 'voll automatisch' mode): when LLM returns
  strategy=url AND extracted text >= 800 words, write a new module
  /data/auto_cmp/auto_<slug>.py with derived regex matcher + reconstruct
- record_discovery() called from dsi_discovery._try_llm_cascade on success

cmp_library/_registry.py:
- Loads both hand-written modules from services/cmp_library/ AND
  auto-promoted modules from /data/auto_cmp/ (CMP_AUTO_DIR env)
- Auto modules use importlib.util.spec_from_file_location, no package
  install needed; restart consent-tester to pick up new ones

dsi_discovery.py:
- _try_llm_cascade now calls record_discovery() on every successful
  LLM analysis (cached AND fresh)

main.py:
- GET /cmp-discoveries — admin endpoint listing all logged discoveries
- DELETE /cmp-discoveries/{id} — rollback (unlinks auto_*.py)

This closes the self-improving loop: first encounter with a new CMP fires
the LLM (cost) → discovery is auto-promoted → all future runs against the
same vendor pattern hit Phase B (Named CMP) at <50ms with no LLM call.
2026-05-16 23:09:23 +02:00
Benjamin Admin 2400aa6a9e feat(consent-tester): Phase C+D — LLM cascade fallback (Qwen → OVH)
New module consent-tester/services/cmp_llm_fallback.py:
- LLMCookieExtractor: single-endpoint adapter (Ollama OR OpenAI-compat)
- LLMCascade: tries Qwen (local Mac Mini Ollama) first; falls through to
  OVH (managed 120B) when Qwen returns no usable strategy
- LLMCascade.from_env(): reads OLLAMA_URL/CMP_LLM_MODEL + OVH_LLM_URL/
  OVH_LLM_KEY/OVH_LLM_MODEL from environment
- LLM returns JSON {strategy: url|selector|text, value: ...}
- Valkey-backed cache per netloc (cmp:hint:<netloc>, 7-day TTL) — next run
  against the same domain skips the LLM entirely

dsi_discovery.py:
- Wired network_log collector (URL/status/content-type/size of every JSON
  response on the page) — passed to LLM prompt as observation
- After Named CMP (Phase B) + Heuristic (Phase A) both fail AND DOM
  < 300 words: invoke LLMCascade.analyze(...)
- _apply_llm_hint executes the LLM's strategy: refetch URL via Playwright
  request context, query DOM selector, or use text directly
- Cache HIT path: apply cached hint, only fall back to LLM if cache is stale

docker-compose.yml:
- consent-tester gets env vars + cmp-data volume (for Phase E)
- All LLM endpoints configurable via env, sensible defaults

consent-tester/requirements.txt:
- redis>=5.0 (asyncio client, Valkey-compatible)
- httpx>=0.27
2026-05-16 23:06:05 +02:00
Benjamin Admin 17a93bc694 fix(consent-tester): prefer CMP-JSON over thin DOM extraction
Previous threshold (DOM < 300 words) missed the BMW case where Playwright
extracted 346 words of pure site navigation. The CMP JSON had 1673 words
of real policy content but was discarded.

New heuristic: prefer CMP when ANY of:
  - DOM < 300 words (existing)
  - CMP text >= 1000 words (authoritative at scale)
  - CMP text >1.5x longer than DOM
2026-05-16 20:56:11 +02:00
Benjamin Admin 1792c6f896 fix(consent-tester): capture CMP JSON to extract dynamically-loaded cookie policies
BMW (and other big enterprise sites) do NOT render cookie policies as
static HTML. Their widget loads structured data from a JSON endpoint
(BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and
renders it client-side after consent. Our DOM extraction therefore only
captured site navigation (603 words of header/footer chrome), not the
actual policy.

New module consent-tester/services/cmp_extractor.py:
- CMPCapture: response listener that catches policy JSON during navigation
- Reconstructors for ePaaS (BMW) + OneTrust placeholder
- Returns Cookie-Richtlinie text built from policyPageMetadata +
  categories + providers (BMW: 1673 words reconstructed vs. 603 noise)

dsi_discovery.py:
- Attach CMPCapture before page.goto
- After self-extraction: if rendered DOM < 300 words AND CMP captured a
  payload, prefer the CMP-reconstructed text. This bypasses the empty
  '.cookie-policy' div problem entirely.
2026-05-16 20:50:15 +02:00
Benjamin Admin e61e9d9e2a feat(agent): progress_pct + 6 BMW-Run Verbesserungen
Backend (agent_compliance_check_routes.py):
- progress_pct (0-100%) im Job-State, ueber alle Phasen verteilt
  (Laden 0-30, Profil 35-40, Pruefen 40-80, Banner 80-92, Report 95-100)
- Status-Texte vereinheitlicht ("Texte laden X/N", "Pruefen X/N")
- Firmenname fuer Email-Subject jetzt aus URL abgeleitet
  (bmw.de -> "BMW", mercedes-benz.de -> "Mercedes-Benz") statt
  unzuverlaessigem extracted_profile.companyName (matchte oft juris.de)
- E-Mail-Report enthaelt jetzt Banner+TCF-Vendor-Liste (build_provider_list_html)

Backend (agent_doc_check_extras.py — neu):
- build_scanned_urls_html: gepruefte URLs als Tabelle oben im Report
  (transparent fuer GF, welche Quellen wirklich gezogen wurden)
- Cross-Domain-Hinweis bei >1 netloc (BMW: bmw.de / bmwgroup.com /
  bmwgroup.jobs — Auffindbarkeit nach Art. 12 DSGVO)
- build_provider_list_html: Banner-Box + TCF-Vendor-Tabelle mit Spalten
  Name | Kategorie | Zweck | Drittland | Rechtsgrundlage

Backend (business_profiler.py):
- §34d-GewO Versicherungsvermittler-Hinweise zaehlen nicht mehr als
  "finance"-Industrie (BMW wurde dadurch falsch als B2B/finance erkannt)
- Neue Industry "automotive" (Fahrzeug/KFZ/Konfigurator/Modellpalette)
- B2B-Keywords: generische Begriffe wie "unternehmen", "beratung",
  "consulting" entfernt (matchten in jedem Konzerntext)
- B2C-Fallback: bei Verbraucher-Signalen ("widerruf", "kunde",
  redaktioneller Inhalt) tendiert auf b2c statt b2b

Frontend (ComplianceCheckTab.tsx):
- Progress-Balken mit Width-% und XX%-Anzeige rechts
- liest data.progress_pct aus Polling-Response

Consent-Tester (dsi_discovery.py):
- Cookie-Policy-Extraktion kritisch fixt: wait_for_function bis
  body.innerText > 500 chars (BMW SPA-Rendering brauchte mehr Zeit)
- _extract_text_robust: 3-Strategien-Extraktion (Selektoren -> Body-
  Cleanup -> P/LI/TD-Tags)
- _extract_text_from_iframes: liest OneTrust/Sourcepoint/Usercentrics
  Iframe-Inhalte (manche Cookie-Policies leben dort)

Adressiert alle Findings aus dem BMW-Ground-Truth-Vergleich.
2026-05-16 17:53:14 +02:00
Benjamin Admin fca67c1f43 fix: accordion close bug + merge multi-page DSIs (BMW fix)
1. _expand_all_interactive(): Only click aria-expanded="false" buttons.
   Before: clicked ALL accordion buttons including open ones → BMW's
   pre-expanded accordions got CLOSED, reducing text from 1151 to 361w.

2. _fetch_text() + /extract-text: merge ALL documents found on a page
   (max_documents=10 instead of 1). BMW splits DSI across 5 sub-pages
   that the discovery finds as separate documents — now merged.

3. Tab panels: unhide hidden tabpanels instead of clicking tabs
   (clicking tabs can hide the currently visible panel).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 13:32:04 +02:00
Benjamin Admin 5e317d2f0f fix: text extraction 50k char limit was root cause of all Spiegel FNs
Build + Deploy / build-admin-compliance (push) Successful in 18s
Build + Deploy / build-backend-compliance (push) Successful in 12s
Build + Deploy / build-ai-sdk (push) Successful in 10s
Build + Deploy / build-developer-portal (push) Successful in 10s
Build + Deploy / build-tts (push) Successful in 10s
Build + Deploy / build-document-crawler (push) Successful in 9s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 15s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m46s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 41s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 13s
Build + Deploy / trigger-orca (push) Successful in 2m13s
ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars.
Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted.
DSB, Art. 77, Betroffenenrechte were all in the truncated portion.

Fixes:
1. Raise text limit from 50k to 200k chars in API response + discovery
2. click_button(): add iframe fallback for Sourcepoint/Quantcast
3. dsi_helpers: iterate ALL page.frames for consent buttons
4. Profiler: only check impressum (not full text) for regulated professions,
   and "rechtsanwalt" must be in first 500 chars (company description)
5. GT: save full Spiegel DSI text (13,705 words) as reference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:22:38 +02:00
Benjamin Admin b2c1f0ae84 fix(consent): add Sourcepoint iframe handler + banner_detector fallback
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m1s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 57s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 28s
CI / test-python-dsms-gateway (push) Successful in 25s
CI / validate-canonical-controls (push) Successful in 15s
Root cause: Spiegel DSI text was truncated because Sourcepoint consent
wall was not dismissed — dsi_helpers.py had no Sourcepoint handler.

Fixes:
1. Add Sourcepoint iframe click (frame_locator + .sp_choice_type_11)
2. Add banner_detector fallback (reuses 30 CMP selectors from scanner)
3. After banner dismiss, wait and re-navigate if page redirected
4. Add "Zustimmen und weiter" to generic text button list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 10:12:50 +02:00
Benjamin Admin c702260ec1 fix: 5 regex bugs + text extraction scroll + GT update
Build + Deploy / build-admin-compliance (push) Successful in 13s
Build + Deploy / build-backend-compliance (push) Successful in 23s
Build + Deploy / build-ai-sdk (push) Successful in 13s
Build + Deploy / build-developer-portal (push) Successful in 14s
Build + Deploy / build-tts (push) Successful in 15s
Build + Deploy / build-document-crawler (push) Successful in 13s
Build + Deploy / build-dsms-gateway (push) Successful in 15s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m26s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 39s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Root cause: Spiegel DSI text was truncated (lazy-loading) — the
rights/DSB/complaints sections at the bottom were never extracted.

Fixes:
1. Text extraction: scroll to bottom before innerText (dsi_discovery.py)
2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern
3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces)
4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel
   "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc.
5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media
   rows from sections found in the DSI text

Ground Truth 06-spiegel.md fully rewritten with verified data from
live website — 3 L1 False Negatives identified (DSB, Beschwerderecht,
Betroffenenrechte all present on website but not in extracted text).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 01:20:55 +02:00
Benjamin Admin fde2f551d7 fix: Add impressum keywords to dsi_discovery.py inline DSI_KEYWORDS
The inline DSI_KEYWORDS in dsi_discovery.py was missing 'impressum'.
This caused self-extraction to skip impressum pages, returning
datenschutz text instead. Added: impressum, anbieterkennzeichnung,
imprint, legal notice, site notice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 14:43:47 +02:00
Benjamin Admin 686834cea0 feat: 4 remaining tasks — EU institutions, banner integration, JS-sites, Caritas fixes
Build + Deploy / build-admin-compliance (push) Successful in 8s
Build + Deploy / build-backend-compliance (push) Successful in 8s
Build + Deploy / build-ai-sdk (push) Failing after 36s
Build + Deploy / build-developer-portal (push) Successful in 8s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 7s
Build + Deploy / build-dsms-gateway (push) Successful in 8s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m14s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 43s
CI / test-python-document-crawler (push) Successful in 29s
CI / test-python-dsms-gateway (push) Successful in 30s
CI / validate-canonical-controls (push) Successful in 16s
1. EU Institution Checks (Verordnung 2018/1725):
   - New doc_type "eu_institution" with 9 L1 + 15 L2 checks
   - Both German + English patterns (EU institutions are multilingual)
   - Auto-detection via "2018/1725", "EDSB", "EDPS" keywords
   - Correct article references (Art. 15 instead of 13, Art. 5 instead of 6)

2. Banner Check Integration:
   - banner_runner.py maps scan results to 36 L1/L2 structured checks
   - BannerCheckTab shows hierarchical ChecklistView with hints
   - 3-phase summary (cookies/scripts before/after consent)
   - /scan endpoint now includes structured_checks in response

3. JS-heavy Website Fixes (dm, Zalando, HWK):
   - dsi_helpers.py: goto_resilient (networkidle→domcontentloaded fallback)
   - try_dismiss_consent_banner before text extraction
   - PDF redirect detection (dm.de redirects to GCS PDF)

4. Caritas False Positive Fixes:
   - Phone regex allows parentheses: +49 (0)761 → now matches
   - "Recht auf Widerspruch" (3 words) + §23 KDG → matches Art. 21
   - Church authorities: "Katholisches Datenschutzzentrum" recognized

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-08 01:10:10 +02:00
Benjamin Admin 608fb7faf5 fix: DSI self-extraction + banner L1/L2 check definitions
Build + Deploy / build-admin-compliance (push) Successful in 2m22s
Build + Deploy / build-backend-compliance (push) Successful in 3m20s
Build + Deploy / build-ai-sdk (push) Successful in 54s
Build + Deploy / build-developer-portal (push) Successful in 1m26s
Build + Deploy / build-tts (push) Successful in 1m38s
Build + Deploy / build-document-crawler (push) Successful in 37s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 18s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m7s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 46s
CI / test-python-backend (push) Successful in 45s
CI / test-python-document-crawler (push) Successful in 30s
CI / test-python-dsms-gateway (push) Successful in 27s
CI / validate-canonical-controls (push) Successful in 17s
Build + Deploy / trigger-orca (push) Successful in 3m37s
1. DSI Discovery fix for direct-URL use case (e.g. example.com/datenschutz):
   - Self-extraction: if the URL itself is a DSE page, extract its text
     directly from the page body (main/article/content element)
   - Remove "datenschutz" from NOISE_TITLES — it's a legitimate doc title
   - Fixes safetykon.de/datenschutz returning 0 documents

2. Banner check definitions (36 checks: 6 L1 + 30 L2):
   - consent-tester/checks/banner_checks.py with expert-level hints
   - EDPB 3/2022, CNIL rulings, EuGH C-673/17, §25 TDDDG references
   - check_key maps to existing consent_scanner check codes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-07 20:53:13 +02:00
Benjamin Admin e494cf62bb fix: Increase page load timeouts — IHK site needs >30s for networkidle
- Initial page.goto timeout: 30s → 60s (IHK loads many JS resources)
- Per-page navigation timeout: 20s → 45s (heavy JS sites)
- Reduced extra wait from 3s+1s back to 2s+0.5s (goto timeout handles slow loads)
- Playwright scanner page timeout: 20s → 45s

Root cause: IHK website has heavy JavaScript that takes >30s to reach
'networkidle' state, causing DSI discovery to fail immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 13:10:59 +02:00
Benjamin Admin d547e63663 fix: DSI dedup prefers 'Datenschutzinformation*' titles + better JS content extraction
Bug 1 fix: When merging documents with identical word_count, prefer
titles starting with 'Datenschutzinformation' over generic section
headings like 'Zweck und Rechtsgrundlage'. This restores the main
'Datenschutzinformationen zum Internetangebot' document.

Bug 2 fix: After navigating to a document page, wait 3s (was 2s) for
JS content loading, then try 10+ content selectors before falling back
to body text (with nav/header/footer removed). Handles IHK-style JS
navigation where content loads after page.goto() completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 12:26:42 +02:00
Benjamin Admin 6c5e086356 fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives
Dedup fixes:
- Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely
- Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic)
- Documents with < 50 words filtered (navigation snippets)
- Documents with identical word_count merged (same page, different title)
- URL-only titles filtered

False positive fixes (dsi_document_checker.py):
- 'Kontaktdaten des Verantwortlichen' pattern for controller check
- 'Zweck und Rechtsgrundlage' combined heading pattern
- 'Welche Daten werden verarbeitet' question-style headings
- 'Betroffenenrechte' as standalone heading
- 'Welche Rechte hat der Betroffene' question pattern
- 'Daten werden geloescht' retention pattern
- 'Auftragsverarbeiter' as recipient indicator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 11:41:07 +02:00
Benjamin Admin 7c7513525e feat: Document-centric scan results + DSI deduplication
DSI Dedup (consent-tester):
- Only H1/H2 headings count as documents (not H3/H4 sub-sections)
- Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of
  parent document's full text, not separate documents
- Reduces IHK result from 30 to ~11 real documents

Backend (agent_scan_routes):
- ScanFinding gets doc_title field linking each finding to its document
- doc_title set when creating DSI findings for document attribution

Frontend (ScanResult.tsx):
- 3 sections: Services table, Document cards, General findings
- Documents: expandable cards with completeness bar (green/yellow/red)
- Findings grouped under their parent document
- Each card shows: title, word count, findings count, % completeness
- Findings without doc_title go to "Allgemeine Findings" section

Email Summary (agent_scan_helpers):
- Findings listed under their parent document
- General findings in separate section
- No more flat mixed list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 09:56:29 +02:00
Benjamin Admin a846bd8910 fix: Exhaustive crawl — no arbitrary page/document limits
Both scanners now search until done, not until a counter runs out:

playwright_scanner.py:
- Default max_pages raised from 15 to 50
- Added 3-minute timeout as safety net
- Recursive link discovery on EVERY visited page (not just DSE pages)
- Stops when: all links visited OR max_pages OR timeout

dsi_discovery.py:
- Default max_documents raised from 30 to 100
- Added 5-minute timeout as safety net
- Recursive: on each visited page, searches for MORE DSI links
- Processes ALL discovered links exhaustively
- Stops when: no more pending links OR max_documents OR timeout

The scanners now behave like a real user: they follow every relevant
link they find, and on each new page they look for more links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:21:57 +02:00
Benjamin Admin 4e63a6050d feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie)
New service: dsi_discovery.py — finds ALL legal documents on any website:
- Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS
- Structure-agnostic: accordions, sidebars, footers, inline links, tabs
- Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links
- Language-agnostic: 26 EU/EEA languages with document-type keywords

Document types discovered:
- Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO)
- AGB / Terms of Service / Nutzungsbedingungen
- Widerrufsbelehrung / Right of Withdrawal (§355 BGB)
- Cookie-Richtlinie / Cookie Policy
- All cross-domain variants (e.g. help.instagram.com from instagram.com)

API: POST /dsi-discovery { url, max_documents }
Returns: list of documents with title, url, language, type, word_count, text_preview

Features:
- Expands all accordions, details, tabs, dropdowns before scanning
- Follows cross-domain links (same registrable domain)
- Re-expands after navigation back to source page
- Handles anchor links (#sections) separately from full pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 21:56:55 +02:00