f19a75d83dcf7d647fa2e69753787bea9e7aad11
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
525038359a |
feat(compliance-check): auto-discover missing doc types from homepage
When the user leaves some doc-type rows empty, the tool now actively searches the website for them — only marks 'not found' as last resort. Flow: 1. User submits N URLs (e.g. just DSI) 2. For each canonical doc_type with no submitted URL/text, the route identifies the most-common base (scheme://netloc) from submitted URLs 3. Calls consent-tester /dsi-discovery on the homepage with max_documents=15 (180s timeout) 4. Classifies every discovered doc into a canonical doc_type via title/URL keyword rules (_DISCOVERY_RULES — covers cookie/widerruf/ social_media/agb/nutzungsbedingungen/dsb/impressum/dse) 5. Fills matching empty entries with the discovered text, marks auto_discovered=True and discovery_attempted=True Padding now differentiates: - 'Auf der Website nicht gefunden' — discovery was attempted, no doc matched. Amber badge, friendly hint to add URL manually. - 'Nicht eingereicht — Quelle nicht angegeben' — user gave NO URLs at all, nothing to crawl from. Grey badge. Email + frontend: - Status labels: NICHT GEFUNDEN (amber) vs NICHT EINGEREICHT (grey) - 'Gepruefte Quellen' table tags auto-discovered URLs with a small blue 'auto-entdeckt' badge so GF sees what tool found vs user submitted. Implementation only runs when ≥1 URL was submitted (no base to crawl from otherwise). Adds 30-90s for unsubmitted types but avoids the 'just say nicht gefunden' anti-pattern. |
||
|
|
bc21480a2a |
fix(compliance-check): always render 8 doc types + 4 BMW GT-gap fixes
Always-show-8 (user-requested): - agent_compliance_check_routes.py: _pad_results_with_missing pads the results list to always include all 8 canonical doc_types in canonical order. Missing types get a placeholder DocCheckResult with error= 'Nicht eingereicht' + scenario='missing'. - agent_doc_check_report.py: NICHT EINGEREICHT status label (neutral), friendly grey body block instead of red error. - ChecklistView.tsx: 'Nicht eingereicht' chip (neutral grey, not red 'Fehler'); SCENARIO_LABELS adds missing entry + header chip counter. Impressum-Regression fix (#18): - _fetch_text(url, doc_type): cookie/dse/social_media -> max_documents=1 (CMP capture authoritative, sub-pages dilute). Other types -> =3 (Impressum needs Versicherungsvermittler, Aufsicht, Berufsrecht sub- pages). 15s networkidle bail keeps timing safe. ODR/Verbraucherstreitbeilegung filter (#19): - _apply_profile_filter: when profile.needs_odr=True (B2C), override the check's default B2B-oriented hint with action-oriented B2C guidance pointing at Art. 14 EU-VO 524/2013 + §36 VSBG. Previously the check contradicted itself: 'profile says B2C' + hint 'only relevant for B2C online vendors'. Registergericht regex (#20): - impressum_checks.py: accept colon/dot/dash between keyword and city (BMW writes 'registergericht: münchen hrb 42243'). Add 'sitz und registergericht: X' as separate pattern. Industry detection (#21): - business_profiler.py: 'automotive' keywords broadened (antriebs, motor, leasing, werkstatt, probefahrt, plus brand names BMW/Mercedes/ Audi/VW/Porsche/Opel). 'it_services' keywords narrowed — software/ cloud/hosting are mentioned in every privacy policy and were biasing the result toward IT for any tech-aware company. |
||
|
|
e61e9d9e2a |
feat(agent): progress_pct + 6 BMW-Run Verbesserungen
Backend (agent_compliance_check_routes.py):
- progress_pct (0-100%) im Job-State, ueber alle Phasen verteilt
(Laden 0-30, Profil 35-40, Pruefen 40-80, Banner 80-92, Report 95-100)
- Status-Texte vereinheitlicht ("Texte laden X/N", "Pruefen X/N")
- Firmenname fuer Email-Subject jetzt aus URL abgeleitet
(bmw.de -> "BMW", mercedes-benz.de -> "Mercedes-Benz") statt
unzuverlaessigem extracted_profile.companyName (matchte oft juris.de)
- E-Mail-Report enthaelt jetzt Banner+TCF-Vendor-Liste (build_provider_list_html)
Backend (agent_doc_check_extras.py — neu):
- build_scanned_urls_html: gepruefte URLs als Tabelle oben im Report
(transparent fuer GF, welche Quellen wirklich gezogen wurden)
- Cross-Domain-Hinweis bei >1 netloc (BMW: bmw.de / bmwgroup.com /
bmwgroup.jobs — Auffindbarkeit nach Art. 12 DSGVO)
- build_provider_list_html: Banner-Box + TCF-Vendor-Tabelle mit Spalten
Name | Kategorie | Zweck | Drittland | Rechtsgrundlage
Backend (business_profiler.py):
- §34d-GewO Versicherungsvermittler-Hinweise zaehlen nicht mehr als
"finance"-Industrie (BMW wurde dadurch falsch als B2B/finance erkannt)
- Neue Industry "automotive" (Fahrzeug/KFZ/Konfigurator/Modellpalette)
- B2B-Keywords: generische Begriffe wie "unternehmen", "beratung",
"consulting" entfernt (matchten in jedem Konzerntext)
- B2C-Fallback: bei Verbraucher-Signalen ("widerruf", "kunde",
redaktioneller Inhalt) tendiert auf b2c statt b2b
Frontend (ComplianceCheckTab.tsx):
- Progress-Balken mit Width-% und XX%-Anzeige rechts
- liest data.progress_pct aus Polling-Response
Consent-Tester (dsi_discovery.py):
- Cookie-Policy-Extraktion kritisch fixt: wait_for_function bis
body.innerText > 500 chars (BMW SPA-Rendering brauchte mehr Zeit)
- _extract_text_robust: 3-Strategien-Extraktion (Selektoren -> Body-
Cleanup -> P/LI/TD-Tags)
- _extract_text_from_iframes: liest OneTrust/Sourcepoint/Usercentrics
Iframe-Inhalte (manche Cookie-Policies leben dort)
Adressiert alle Findings aus dem BMW-Ground-Truth-Vergleich.
|
||
|
|
d72aa10691 |
feat: management summary for GF + batch GT test script
1. Management Summary (agent_doc_check_report.py):
- Plain-language action items for Geschaeftsfuehrer
- Maps technical checks to business actions ("Ihren DSB erwaehnen",
"Beschwerderecht ergaenzen", "Loeschfristen dokumentieren")
- Shows at top of compliance check email before detail report
- Max 10 actions, max 3 per document
2. Batch GT Test (zeroclaw/scripts/batch_gt_test.py):
- Runs all 10 GT websites through compliance-check API
- Prints comparison table with L1 scores, word counts, services
- Saves raw JSON results for analysis
- Usage: python3 batch_gt_test.py --sites 1,6 --backend-url URL
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
||
|
|
7be34552bb |
feat(compliance-check): profile extraction + scenario classification
Build + Deploy / build-admin-compliance (push) Successful in 15s
Build + Deploy / build-backend-compliance (push) Successful in 21s
Build + Deploy / build-ai-sdk (push) Successful in 46s
Build + Deploy / build-developer-portal (push) Successful in 12s
Build + Deploy / build-tts (push) Successful in 13s
Build + Deploy / build-document-crawler (push) Successful in 11s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m46s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 47s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 16s
Build + Deploy / trigger-orca (push) Successful in 2m29s
- New profile_extractor.py: extracts Company Profile fields (name, legal form, address, DPO, USt-IdNr) and Compliance Scope hints (Art. 9 data, third country, profiling) from document texts - Scenario per document: regenerate (<30%), fix (30-95%), import (>95%) - Widerruf for B2B: no longer skipped, instead all checks flagged as INFO with "not needed for B2B" hint - Move _build_profile_html to report builder module - DocCheckResult gets scenario field Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
a3287cd5e6 |
feat: HTML email report with hints + fix duplicate Social Media sections
Build + Deploy / build-admin-compliance (push) Successful in 1m45s
Build + Deploy / build-backend-compliance (push) Successful in 9s
Build + Deploy / build-ai-sdk (push) Successful in 36s
Build + Deploy / build-developer-portal (push) Successful in 7s
Build + Deploy / build-tts (push) Successful in 7s
Build + Deploy / build-document-crawler (push) Successful in 8s
Build + Deploy / build-dsms-gateway (push) Successful in 7s
Build + Deploy / build-dsms-node (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m47s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 44s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m23s
1. Email report now renders as styled HTML (matching frontend design): - Progress bars (green=completeness, blue=correctness) - Hierarchical L1→L2 check display - Red hint boxes under failed checks explaining what to fix - Matched text evidence for passed checks 2. Section splitter deduplicates: two "Social Media" headings on the same page are merged into one section instead of creating duplicates. 3. Extracted report builder to agent_doc_check_report.py (175 LOC) to keep routes file under 500 LOC (386 LOC). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |