breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	313982c6f1	feat(profile+report): P17 — 4 Polish-Items CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 16s Details CI / loc-budget (push) Successful in 19s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Has been skipped Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 39s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details A) Cookie-Policy-Architecture-Block Fallback auf DSE-Text wenn cookie via P15 deduped wurde. Erkennt jetzt auch single-doc Sites (Safetykon-Pattern). B) Konkrete-Aufgaben-Liste: Per-Doc-Cap (3) entfernt + globaler Cap 10→20. Safetykon zeigt jetzt 7 statt 4 Aufgaben. C) business_type-Klassifizierer: B2B-Service-Cluster aus P14 als Boost. Bei 2+ Service-Indikatoren (CE-Zertifizierung/Compliance/Auditierung) wird b2b_score angehoben. Safetykon: "B2C consulting" → "B2B (consulting)". D) Vendor-Extract Fallback auf DSE-Text wenn cookie deduped + keine CMP- Payloads. LLM extrahiert dann Vendors aus dem DSE-Text. Safetykon: 0 → 1 Vendor (Google Analytics aus dem DSE-Text erkannt). Smoke-Test Safetykon: alle 4 Polish-Items wirken, kein Regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 12:22:05 +02:00
Benjamin Admin	479ce2225b	feat(profile): P14+P15+P16 — B2B-Heuristik + Doc-URL-Dedup + Homepage-Profile P14 — _detect_no_direct_sales erweitert um 3 Cluster: A) OEM-Konfigurator (BMW/Audi/Mercedes/VW/Porsche-Markennamen + Vertragshaendler-Pattern) B) B2B-Dienstleister (CE-Zertifizierung, Compliance-Beratung, Schulungen, Auditierung, TISAX, ISO-Normen, Arbeitssicherheit, ...) C) NGO/Verein/Public (Spendenkonto, Vereinsregister, gemeinnuetzig, ...) Schwelle: pos >= 2 pro Cluster UND pos > neg. Bisher: nur OEM. P15 — Doc-URL-Dedup im Worker: wenn mehrere Doc-Types DASSELBE Dokument referenzieren (Safetykon-Pattern: User gibt /datenschutz fuer dse, cookie UND widerruf), wird nur dem primaeren Doc-Type (Priority: dse > impressum > cookie > widerruf > agb > nutzungsbedingungen) der Text gegeben. Andere landen als "Nicht separat vorhanden — wird im Dokument 'X' mit-geprueft." Eliminiert die 8+8 systematischen widerruf/cookie False Positives. P16 — Profile-Detection auch Homepage-Text: Homepage-HTML wird mit kurzem Fetch (8s timeout) gezogen, getrippt und zum profile_input gemerged. Vor- her wirkte P14 nur wenn B2B-Indikatoren im DSE/Impressum-Pflichttext standen — bei Safetykon stehen sie nur im Homepage-Menue. Plus Bonus: TDM-Override-Submit-Button wird deaktiviert wenn Reason < 10 Zeichen — verhindert dass User wie heute in den Bug rein klickt. Smoke-Test Safetykon (B2B Compliance-Dienstleister): dse geprueft (kein err) impressum geprueft (kein err) cookie "Nicht separat vorhanden — wird in DSE mit-geprueft" agb "Nicht anwendbar — kein Direkt-Kaufvertrag" widerruf "Nicht anwendbar — kein Direkt-Kaufvertrag" nutzungsbedingungen "Nicht anwendbar — kein Direkt-Kaufvertrag" Vorher: 16 False Positives. Jetzt: 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 11:46:58 +02:00
Benjamin Admin	6c223c7c9b	feat(compliance-check): exec-summary + voll-audit + TDM-respect + cookie-KB-extended + saving-scan-funnel CI / detect-changes (push) Successful in 10s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / secret-scan (push) Has been skipped Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / validate-canonical-controls (push) Successful in 14s Details CI / loc-budget (push) Failing after 15s Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m43s Details CI / test-go (push) Has been skipped Details CI / iace-gt-coverage (push) Has been skipped Details CI / test-python-backend (push) Successful in 37s Details CI / test-python-document-crawler (push) Has been skipped Details CI / test-python-dsms-gateway (push) Has been skipped Details P1 — Exec-Summary oben im Email-Report (4 KPIs + 2 CTAs, dunkler Gradient) P3 — no_direct_sales-Flag fuer OEM-Konfigurator-Sites; AGB/Widerruf/AGB als "NICHT ANWENDBAR" (grau) statt "NICHT GEFUNDEN" (rot) P5 — Voll-Audit Unification: alle Findings (MC + Pflichtangaben + Vendor + Redundanz) in /data/compliance_audits.db.unified_findings; neuer /api/compliance/agent/findings/<id> Endpoint + FindingsTab im Audit-UI mit Filter + CSV-Export P7 — Crawl-Hardening: TDM-Reservation-Check (robots.txt / ai.txt / Header / Meta) vor jedem Run mit 24h-Cache; HeadlessChrome-UA (Firma noch nicht gegruendet — Switch via BREAKPILOT_BRANDED_UA env); per-Domain Rate-Limit 1 req/s + max 2 concurrent P2 — Cookie-Knowledge-DB additiv erweitert (35 -> 74 Cookies): Adobe, Meta, Microsoft, LinkedIn, TikTok, HubSpot, Marketo, Salesforce, Hotjar, FullStory, Mouseflow, Intercom, Drift, Zendesk, Cloudflare, Stripe, OneTrust/Cookiebot/Usercentrics, Matomo, Pinterest, Snapchat, X/Twitter, YouTube, Vimeo, Klaviyo, Mailchimp, Mixpanel, Segment, Amplitude, Optimizely, Datadog; Wire-in in cookie_function_classifier liefert compliance_risk-Label (kritisch/hoch/mittel/gering) pro Vendor A — k-Anonymitaets-Helper (benchmark_k_anonymity) fuer P6-Vorbereitung B — Cross-Tenant-Domain-Assertion im /findings-Endpoint (expected_domain Query-Param -> 403 bei Mismatch) C — Saving-Scan-Funnel: /api/compliance/agent/saving-scan/start mit Validierung + 24h-Rate-Limit pro Domain + Lead-Persistenz in saving_scan_leads + Auto-Discovery via _run_compliance_check; 6 Tests D — Risk-Badge im Email-Vendor-Row Rechtliche Leitplanken (Memory feedback_oem_data_legal.md): nur eigene Knapp-Bewertungen + Source-Pointer, keine 1:1-Kopien fremder CMP-Texte. TDM-Opt-Out-Respect nach § 44b UrhG. KEINE Schema-Aenderungen — alles in Sidecar-SQLite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:48:34 +02:00
Benjamin Admin	bc21480a2a	fix(compliance-check): always render 8 doc types + 4 BMW GT-gap fixes Always-show-8 (user-requested): - agent_compliance_check_routes.py: _pad_results_with_missing pads the results list to always include all 8 canonical doc_types in canonical order. Missing types get a placeholder DocCheckResult with error= 'Nicht eingereicht' + scenario='missing'. - agent_doc_check_report.py: NICHT EINGEREICHT status label (neutral), friendly grey body block instead of red error. - ChecklistView.tsx: 'Nicht eingereicht' chip (neutral grey, not red 'Fehler'); SCENARIO_LABELS adds missing entry + header chip counter. Impressum-Regression fix (#18): - _fetch_text(url, doc_type): cookie/dse/social_media -> max_documents=1 (CMP capture authoritative, sub-pages dilute). Other types -> =3 (Impressum needs Versicherungsvermittler, Aufsicht, Berufsrecht sub- pages). 15s networkidle bail keeps timing safe. ODR/Verbraucherstreitbeilegung filter (#19): - _apply_profile_filter: when profile.needs_odr=True (B2C), override the check's default B2B-oriented hint with action-oriented B2C guidance pointing at Art. 14 EU-VO 524/2013 + §36 VSBG. Previously the check contradicted itself: 'profile says B2C' + hint 'only relevant for B2C online vendors'. Registergericht regex (#20): - impressum_checks.py: accept colon/dot/dash between keyword and city (BMW writes 'registergericht: münchen hrb 42243'). Add 'sitz und registergericht: X' as separate pattern. Industry detection (#21): - business_profiler.py: 'automotive' keywords broadened (antriebs, motor, leasing, werkstatt, probefahrt, plus brand names BMW/Mercedes/ Audi/VW/Porsche/Opel). 'it_services' keywords narrowed — software/ cloud/hosting are mentioned in every privacy policy and were biasing the result toward IT for any tech-aware company.	2026-05-17 01:03:58 +02:00
Benjamin Admin	e61e9d9e2a	feat(agent): progress_pct + 6 BMW-Run Verbesserungen Backend (agent_compliance_check_routes.py): - progress_pct (0-100%) im Job-State, ueber alle Phasen verteilt (Laden 0-30, Profil 35-40, Pruefen 40-80, Banner 80-92, Report 95-100) - Status-Texte vereinheitlicht ("Texte laden X/N", "Pruefen X/N") - Firmenname fuer Email-Subject jetzt aus URL abgeleitet (bmw.de -> "BMW", mercedes-benz.de -> "Mercedes-Benz") statt unzuverlaessigem extracted_profile.companyName (matchte oft juris.de) - E-Mail-Report enthaelt jetzt Banner+TCF-Vendor-Liste (build_provider_list_html) Backend (agent_doc_check_extras.py — neu): - build_scanned_urls_html: gepruefte URLs als Tabelle oben im Report (transparent fuer GF, welche Quellen wirklich gezogen wurden) - Cross-Domain-Hinweis bei >1 netloc (BMW: bmw.de / bmwgroup.com / bmwgroup.jobs — Auffindbarkeit nach Art. 12 DSGVO) - build_provider_list_html: Banner-Box + TCF-Vendor-Tabelle mit Spalten Name \| Kategorie \| Zweck \| Drittland \| Rechtsgrundlage Backend (business_profiler.py): - §34d-GewO Versicherungsvermittler-Hinweise zaehlen nicht mehr als "finance"-Industrie (BMW wurde dadurch falsch als B2B/finance erkannt) - Neue Industry "automotive" (Fahrzeug/KFZ/Konfigurator/Modellpalette) - B2B-Keywords: generische Begriffe wie "unternehmen", "beratung", "consulting" entfernt (matchten in jedem Konzerntext) - B2C-Fallback: bei Verbraucher-Signalen ("widerruf", "kunde", redaktioneller Inhalt) tendiert auf b2c statt b2b Frontend (ComplianceCheckTab.tsx): - Progress-Balken mit Width-% und XX%-Anzeige rechts - liest data.progress_pct aus Polling-Response Consent-Tester (dsi_discovery.py): - Cookie-Policy-Extraktion kritisch fixt: wait_for_function bis body.innerText > 500 chars (BMW SPA-Rendering brauchte mehr Zeit) - _extract_text_robust: 3-Strategien-Extraktion (Selektoren -> Body- Cleanup -> P/LI/TD-Tags) - _extract_text_from_iframes: liest OneTrust/Sourcepoint/Usercentrics Iframe-Inhalte (manche Cookie-Policies leben dort) Adressiert alle Findings aus dem BMW-Ground-Truth-Vergleich.	2026-05-16 17:53:14 +02:00
Benjamin Admin	33bf2b7c5a	feat(service-detector): detect 118 services in legal texts (was 20) Build + Deploy / build-admin-compliance (push) Successful in 2m5s Details Build + Deploy / build-backend-compliance (push) Successful in 3m26s Details Build + Deploy / build-ai-sdk (push) Successful in 56s Details Build + Deploy / build-developer-portal (push) Successful in 1m29s Details Build + Deploy / build-tts (push) Failing after 1m48s Details Build + Deploy / build-document-crawler (push) Successful in 44s Details Build + Deploy / build-dsms-gateway (push) Successful in 28s Details Build + Deploy / build-dsms-node (push) Successful in 17s Details CI / branch-name (push) Has been skipped Details Build + Deploy / trigger-orca (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m45s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 52s Details CI / test-python-backend (push) Successful in 36s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 21s Details CI / validate-canonical-controls (push) Successful in 14s Details New service_detector.py uses service_registry (88 entries) plus 30+ extra text patterns to detect services mentioned in DSI/legal texts. Results on Spiegel: 31/32 services detected (97%, was 5/32 = 16%). Includes metadata: name, category, country, EU adequacy status. - Profiler now uses detect_services_in_text() instead of 20-entry list - Profile extractor adds detected_services with full metadata - Auto-generates scope hint for non-EU services (Drittlandtransfer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 16:00:15 +02:00
Benjamin Admin	5e317d2f0f	fix: text extraction 50k char limit was root cause of all Spiegel FNs Build + Deploy / build-admin-compliance (push) Successful in 18s Details Build + Deploy / build-backend-compliance (push) Successful in 12s Details Build + Deploy / build-ai-sdk (push) Successful in 10s Details Build + Deploy / build-developer-portal (push) Successful in 10s Details Build + Deploy / build-tts (push) Successful in 10s Details Build + Deploy / build-document-crawler (push) Successful in 9s Details Build + Deploy / build-dsms-gateway (push) Successful in 10s Details Build + Deploy / build-dsms-node (push) Successful in 15s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m46s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Failing after 41s Details CI / test-python-backend (push) Successful in 37s Details CI / test-python-document-crawler (push) Successful in 27s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 13s Details Build + Deploy / trigger-orca (push) Successful in 2m13s Details ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars. Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted. DSB, Art. 77, Betroffenenrechte were all in the truncated portion. Fixes: 1. Raise text limit from 50k to 200k chars in API response + discovery 2. click_button(): add iframe fallback for Sourcepoint/Quantcast 3. dsi_helpers: iterate ALL page.frames for consent buttons 4. Profiler: only check impressum (not full text) for regulated professions, and "rechtsanwalt" must be in first 500 chars (company description) 5. GT: save full Spiegel DSI text (13,705 words) as reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 15:22:38 +02:00
Benjamin Admin	c702260ec1	fix: 5 regex bugs + text extraction scroll + GT update Build + Deploy / build-admin-compliance (push) Successful in 13s Details Build + Deploy / build-backend-compliance (push) Successful in 23s Details Build + Deploy / build-ai-sdk (push) Successful in 13s Details Build + Deploy / build-developer-portal (push) Successful in 14s Details Build + Deploy / build-tts (push) Successful in 15s Details Build + Deploy / build-document-crawler (push) Successful in 13s Details Build + Deploy / build-dsms-gateway (push) Successful in 15s Details Build + Deploy / build-dsms-node (push) Successful in 14s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 15s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m26s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 39s Details CI / test-python-backend (push) Successful in 39s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 2m28s Details Root cause: Spiegel DSI text was truncated (lazy-loading) — the rights/DSB/complaints sections at the bottom were never extracted. Fixes: 1. Text extraction: scroll to bottom before innerText (dsi_discovery.py) 2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern 3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces) 4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc. 5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media rows from sections found in the DSI text Ground Truth 06-spiegel.md fully rewritten with verified data from live website — 3 L1 False Negatives identified (DSB, Beschwerderecht, Betroffenenrechte all present on website but not in extracted text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 01:20:55 +02:00
Benjamin Admin	be9cfdc2d4	feat(compliance-check): skip Widerruf for B2B, limit MCs, fix industry Build + Deploy / build-admin-compliance (push) Successful in 2m1s Details Build + Deploy / build-backend-compliance (push) Successful in 4m20s Details Build + Deploy / build-ai-sdk (push) Successful in 53s Details Build + Deploy / build-developer-portal (push) Successful in 2m6s Details Build + Deploy / build-tts (push) Successful in 2m48s Details Build + Deploy / build-document-crawler (push) Successful in 52s Details Build + Deploy / build-dsms-gateway (push) Successful in 11s Details Build + Deploy / build-dsms-node (push) Successful in 13s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 15s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m45s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 45s Details CI / test-python-backend (push) Successful in 41s Details CI / test-python-document-crawler (push) Successful in 26s Details CI / test-python-dsms-gateway (push) Successful in 21s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 3m17s Details - Skip Widerrufsbelehrung check entirely for B2B/B2G businesses - Limit MC checks to top 20 per doc_type (by severity) to reduce noise (e.g. 75 impressum MCs → 20, avoiding 55 irrelevant FAILs) - Add consulting/manufacturing industry keywords (arbeitssicherheit, brandschutz, werkzeugbau, etc.) - Lower industry detection threshold from 2 to 1 keyword hit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 17:03:57 +02:00
Benjamin Admin	407a9503e4	fix(profiler): fix B2G false positive + add consulting/manufacturing Build + Deploy / build-admin-compliance (push) Successful in 2m27s Details Build + Deploy / build-backend-compliance (push) Successful in 3m40s Details Build + Deploy / build-ai-sdk (push) Successful in 1m0s Details Build + Deploy / build-developer-portal (push) Successful in 1m16s Details Build + Deploy / build-tts (push) Successful in 1m54s Details Build + Deploy / build-document-crawler (push) Successful in 1m2s Details Build + Deploy / build-dsms-gateway (push) Successful in 31s Details Build + Deploy / build-dsms-node (push) Successful in 20s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 17s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m44s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 49s Details CI / test-python-backend (push) Successful in 36s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 21s Details CI / validate-canonical-controls (push) Successful in 14s Details Build + Deploy / trigger-orca (push) Successful in 3m23s Details - Remove generic B2G keywords (behörde, amt, öffentlich) that match in every DSI due to "Aufsichtsbehörde", "Amtsgericht", "veröffentlichen" - Remove "server" from it_services (too generic, appears in every DSI) - Add consulting, manufacturing, media industries - Add B2B fallback for GmbH/AG without B2C signals - Add 10 ground truth files for unified compliance check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 12:20:44 +02:00
Benjamin Admin	0d0e705117	feat: Unified Compliance-Check — 8 document types in one form New 3-tab structure: Website-Scan, Compliance-Check, Banner-Check. Compliance-Check Tab (replaces Dokumenten-Pruefung + Impressum-Check): - 8 document rows: DSI, Impressum, Social Media, Cookie, AGB, Nutzungsbedingungen, Widerruf, DSB-Kontakt - Each row: URL input + "Text laden" + file upload + manual text - "Text laden" extracts via consent-tester, shows in editable textarea - User verifies/corrects text before checking - Empty fields = "not present" → own finding Business Profiler (business_profiler.py): - Detects B2B/B2C/B2G from all documents together - Recognizes regulated professions, online shops, editorial content - Context-aware: INFO checks become PASS/FAIL based on profile Backend: /compliance-check + /extract-text endpoints Frontend: ComplianceCheckTab.tsx + DocumentRow.tsx API proxies: compliance-check/route.ts + extract-text/route.ts Also: Impressum regex fixes (Telefon, AG, Geschaeftsfuehrung) and INFO severity for context-dependent checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-11 20:56:10 +02:00

11 Commits