breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	6da36d87c2	fix: Robust JSON parsing for LLM responses — handles unquoted keys, fallback extraction LLM returns {fulfilled: true} instead of {"fulfilled": true}. Now fixes unquoted keys, True→true, and falls back to text-based boolean extraction when JSON parsing fails entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 15:18:52 +02:00
Benjamin Admin	e50c4d659e	fix: Disable Qwen thinking mode for RAG checks (/no_think prefix) Qwen 3.5 uses all tokens for thinking, leaving response empty. Using /no_think prefix to get direct JSON output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 15:12:51 +02:00
Benjamin Admin	9f16e6d535	fix: Read Qwen response from 'thinking' field when 'response' is empty Qwen 3.5 with latest Ollama returns structured thinking in separate 'thinking' field, leaving 'response' empty. Now checks both fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 15:07:09 +02:00
Benjamin Admin	1ff34227bf	debug: Add logging to RAG check integration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 14:57:30 +02:00
Benjamin Admin	f4374cfe8d	feat: Semantic Qdrant search — embed query via bge-m3, vector search in local Qdrant Replaces scroll+filter approach with proper semantic search: 1. Embed query via bp-core-embedding-service (bge-m3, 1024 dim) 2. Vector search in Qdrant (bp_compliance_datenschutz + bp_compliance_gesetze) 3. Sort by cosine similarity score 4. No API key needed — local Qdrant on Mac Mini Falls back gracefully: SDK first, then semantic Qdrant, then empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 14:46:06 +02:00
Benjamin Admin	7b8440191e	fix: Better error logging + increase LLM timeout to 120s for RAG check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 14:33:58 +02:00
Benjamin Admin	510f513811	fix: Qdrant search uses chunk_text + section/category filter Payload structure: chunk_text (not text), section (Article 13), category, regulation_id. Scrolls 100 points per collection, filters client-side against regulation keywords. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 14:28:32 +02:00
Benjamin Admin	b50c4ec940	fix: RAG checker falls back to local Qdrant when Go SDK returns 401 Go SDK points to external Qdrant (qdrant-dev.breakpilot.ai) with expired API key. Fallback: search directly in local Qdrant (bp-core-qdrant:6333) which has all collections: bp_compliance_datenschutz, bp_compliance_gesetze, atomic_controls_dedup. Search strategy: 1. Try Go SDK RAG endpoint (preferred, has embedding-based search) 2. Fallback: Qdrant scroll with text-based regulation filter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 14:23:52 +02:00
Benjamin Admin	090da0f71b	feat: RAG-based document verification against 144K Control Library New module: rag_document_checker.py - Searches RAG (Qdrant) for controls relevant to document type - Filters by regulation (DSGVO Art.13, TDDDG §25, BGB §355 etc.) - LLM (Qwen 3.5:35b) verifies each control against document text - Returns fulfilled/missing with evidence text + severity - Supports: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept Integration in doc-check endpoint: - Regex checklist runs first (fast, deterministic) - RAG checks run after (semantic, catches what regex misses) - Both results combined in single response LLM prompt returns JSON: {fulfilled, evidence, issue, severity} Think-tags stripped, JSON extracted from response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 13:19:15 +02:00
Benjamin Admin	13c5880f51	fix: Restrict sub-section detection to genuinely separate document types Only Cookie and Widerruf sections are checked as separate documents. Social Media, DSFA, Betroffenenrechte, Dienste von Drittanbietern are part of the parent DSI and no longer generate false findings. Added PLAN-rag-document-check.md for Phase 2: - RAG-based checks with document-type-specific Controls - DSFA checklist (Art. 35 + Landes-Listen) - AVV checklist (Art. 28) - Reference detection (sub-doc → parent doc) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 11:02:36 +02:00
Benjamin Admin	0416bb5d04	fix: Checklist expand — use index instead of URL (prevents all opening at once) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 10:56:44 +02:00
Benjamin Admin	539bc824fd	feat: Auto-detect sub-sections within a page and check each separately When a single URL contains multiple document sections (e.g. IHK DSI page with Cookies, Social Media, Dienste von Drittanbietern), the system now: 1. Extracts full page text (main document check as before) 2. Splits text at heading boundaries (short uppercase lines) 3. Classifies each section: Cookie→cookie checklist, Social Media→DSI etc. 4. Runs type-specific checklist per section 5. Returns all results: main doc + sub-sections Section type detection via SECTION_TYPE_MAP patterns: - 'Cookie*' → §25 TDDDG checklist - 'Dienste von Drittanbietern' → DSI checklist - 'Social Media' → DSI checklist (Art. 26 joint controllership) - 'Widerrufsrecht' → §355 BGB checklist - 'Impressum' → §5 TMG checklist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 10:44:42 +02:00
Benjamin Admin	4c68caac4e	feat: Multi-URL Document Check with full checklist visibility New "Dokumenten-Pruefung" tab in Compliance Agent: - User adds multiple URLs with document type (DSI, AGB, Impressum, Cookie, Widerruf) - Each document loaded via Playwright, accordions expanded, text extracted - Checked against type-specific legal checklist - Optional: Cookie banner check via checkbox Checklisten-UX (solves "100% looks like nothing was checked"): - All checks shown per document: green checkmark + matched text excerpt - Red X for missing fields with legal reference - Builds user trust: "9 Punkte geprueft, alle bestanden" - Expandable per document with completeness bar New checklists: - Impressum: §5 TMG (6 fields: name, address, contact, register, VAT, representative) - Cookie-Richtlinie: §25 TDDDG (5 fields: types, purposes, retention, third-party, opt-out) Backend: - POST /agent/doc-check — async with polling (same pattern as /scan) - DocCheckResult includes checks[] with passed/failed + matched_text - dsi_document_checker returns all_checks in SCORE finding - Email report shows per-document checklist Files: agent_doc_check_routes.py (280 LOC), DocCheckTab.tsx (248 LOC), ChecklistView.tsx (130 LOC), dsi_document_checker.py (+70 LOC) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-06 10:08:40 +02:00
Benjamin Admin	254dbab566	fix: Keep every scan in history (no dedup by URL) Each scan is a separate entry so users can track changes over time. Increased max entries from 20 to 50. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 18:31:17 +02:00
Benjamin Admin	ef8e7e599f	feat: IACE +40 DGUV-extended patterns (HP094-HP133) — 133 total Mechanical extended (HP094-HP103): Cutting, impact, friction, high-pressure jet, ejection of fragments, tripping, gear/chain entanglement, clothing winding, pendulating loads, tool kickback Electrical extended (HP104-HP109): Arc flash, capacitor residual charge, static discharge, grounding fault, induced voltage, overcurrent fire Hazardous substances (HP110-HP117): Dust explosion, solvent vapors, cutting fluid irritation, welding fumes, chemical burns, suffocation in confined spaces, biological contamination, asbestos release Radiation (HP118-HP123): Laser eye injury, UV from welding, infrared heat, EMF induction, ionizing radiation, glare Fire/Explosion (HP124-HP130): Electrical overheating, gas/vapor explosion, hydraulic oil fire, metal dust fire, pressure vessel burst, oxygen enrichment, spontaneous combustion Ergonomic extended (HP131-HP133): RSI, whole-body vibration, hand-arm vibration Total pattern library: 133 patterns (44 builtin + 14 press + 7 cobot + 28 operational + 40 DGUV) + ~58 extended rule library Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 18:22:57 +02:00
Benjamin Admin	8fb2061e9b	fix: Eliminate GA false positive + handle short DSI documents Service detection: - Only search script tags + src/href attributes for service patterns - Prevents false positives from DSE text mentioning services (e.g. IHK DSE describes etracker, 'google analytics' in text) - Technical patterns (with regex chars) still checked in full HTML Short documents: - Documents with < 200 words flagged as 'Kurzhinweis' instead of 'MANGELHAFT' — too short for Art. 13 completeness check - Prevents 96-word navigation pages from showing 8 missing fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 18:21:37 +02:00
Benjamin Admin	8d6959e8b2	fix: Expand Art. 13 patterns for generic matching across all websites Complaint (Art. 13(2)(d)): + 'recht auf beschwerde', 'art. 77', 'beschwerde...wenden/einlegen', 'zuständige behörde' — IHK uses 'Recht auf Beschwerde gem. Art. 77' Legal basis (Art. 13(1)(c)): + 'gemäß Art.', '§ X IHKG/BDSG/LDSG/BBiG/TDDDG', 'einwilligung gem', 'verarbeitung auf grundlage' — catches statutory references Third country (Art. 13(1)(f)): + 'Übermittlung ausserhalb', 'EWR/EEA', 'Data Privacy Framework' Retention (Art. 13(2)(a)): + 'Dauer der Speicherung', 'Aufbewahrungsdauer/-pflicht/-zeit', 'gesetzliche Aufbewahrung' — common German DSE headings All patterns are generic, not IHK-specific. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 17:45:02 +02:00
Benjamin Admin	85e82d0dfa	feat: IACE 28 operational hazard patterns (HP066-HP093) Fault Clearing (HP066-HP072): Jammed parts releasing, hose bursts, unexpected restart, stored energy, intervention in running machine, material jam, falling parts during fault clearing Maintenance (HP073-HP079): Missing LOTO, falls from platforms, hot parts contact, hazardous substances, electric shock, ergonomic access, uncontrolled hydraulic lowering Setup/Changeover (HP080-HP085): Crushing during tool change, burns from hot tools, heavy tool drops, unintended stroke in setup mode, wrong parameters, test cycle hits personnel Transport/Install/Decommission (HP086-HP090): Machine tipping, crushing during installation, uncontrolled commissioning movement, residual media, sharp edges Cleaning (HP091-HP093): Slipping, chemical exposure, draw-in Lifecycle keywords expanded: werkzeugwechsel, stoerung, fehlersuche, klemm, blockier, stau → trigger fault_clearing phase patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 17:42:38 +02:00
Benjamin Admin	a349111a01	fix: Raise full_text limit 10K→50K + combine all DSI texts for checks Two fixes: 1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars (IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff) 2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts for mandatory content checking. Previously only used first 8K chars from one source, missing Verantwortlicher/DSB that were in DSI documents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 16:03:56 +02:00
Benjamin Admin	3ac8d0cba8	fix: IACE mitigations page — remove broken 'm.' prefix + accept 'protective' type Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 15:52:10 +02:00
Benjamin Admin	e3ae35891f	fix: 0% completeness bug — SCORE finding was not generated at 100% Root cause: When all 9 Art. 13 checks passed (100%), no SCORE finding was created (line: 'if pct < 100'). The backend then defaulted to completeness=0 because it looked for the SCORE finding to extract the %. Fix: Always generate SCORE finding, even at 100%. Added 'OK' severity for fully compliant documents. This was the cause of 8 documents showing '0% MANGELHAFT' despite containing all required information. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 15:34:04 +02:00
Benjamin Admin	72761d6066	debug: Log DSI text lengths to diagnose 0% completeness bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 14:08:04 +02:00
Benjamin Admin	e494cf62bb	fix: Increase page load timeouts — IHK site needs >30s for networkidle - Initial page.goto timeout: 30s → 60s (IHK loads many JS resources) - Per-page navigation timeout: 20s → 45s (heavy JS sites) - Reduced extra wait from 3s+1s back to 2s+0.5s (goto timeout handles slow loads) - Playwright scanner page timeout: 20s → 45s Root cause: IHK website has heavy JavaScript that takes >30s to reach 'networkidle' state, causing DSI discovery to fail immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 13:10:59 +02:00
Benjamin Admin	d547e63663	fix: DSI dedup prefers 'Datenschutzinformation*' titles + better JS content extraction Bug 1 fix: When merging documents with identical word_count, prefer titles starting with 'Datenschutzinformation' over generic section headings like 'Zweck und Rechtsgrundlage'. This restores the main 'Datenschutzinformationen zum Internetangebot' document. Bug 2 fix: After navigating to a document page, wait 3s (was 2s) for JS content loading, then try 10+ content selectors before falling back to body text (with nav/header/footer removed). Handles IHK-style JS navigation where content loads after page.goto() completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 12:26:42 +02:00
Benjamin Admin	b4f90ed113	fix: IACE components page — remove broken 'c.' prefix from refactor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 12:20:09 +02:00
Benjamin Admin	daa47bb7ab	feat: Scan history — shows last 20 scans with URL, date, findings count - localStorage-based scan history (persists across sessions) - Each completed scan adds entry: URL, timestamp, findings count, docs count - 'Letzte Scans' section below results shows clickable history entries - Click loads URL into form (and shows cached result if same URL) - Max 20 entries, deduplicates by URL (latest scan wins) - History visible in 'Website-Scan' tab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 11:52:35 +02:00
Benjamin Admin	6c5e086356	fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives Dedup fixes: - Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely - Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic) - Documents with < 50 words filtered (navigation snippets) - Documents with identical word_count merged (same page, different title) - URL-only titles filtered False positive fixes (dsi_document_checker.py): - 'Kontaktdaten des Verantwortlichen' pattern for controller check - 'Zweck und Rechtsgrundlage' combined heading pattern - 'Welche Daten werden verarbeitet' question-style headings - 'Betroffenenrechte' as standalone heading - 'Welche Rechte hat der Betroffene' question pattern - 'Daten werden geloescht' retention pattern - 'Auftragsverarbeiter' as recipient indicator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 11:41:07 +02:00
Benjamin Admin	8e40155459	feat: Scan state persists across navigation — resume polling on return - URL, mode, tab, scan result persisted in localStorage - Active scan_id stored — polling resumes when returning to page - Scan results survive navigation to other SDK modules - 'Scan laeuft noch...' shown when returning to in-progress scan - Cleans up localStorage when scan completes or fails Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 10:47:39 +02:00
Benjamin Admin	b5cf25f6ab	fix: IACE overview null-check for risk_summary (empty projects) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 10:44:16 +02:00
Benjamin Admin	7c7513525e	feat: Document-centric scan results + DSI deduplication DSI Dedup (consent-tester): - Only H1/H2 headings count as documents (not H3/H4 sub-sections) - Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of parent document's full text, not separate documents - Reduces IHK result from 30 to ~11 real documents Backend (agent_scan_routes): - ScanFinding gets doc_title field linking each finding to its document - doc_title set when creating DSI findings for document attribution Frontend (ScanResult.tsx): - 3 sections: Services table, Document cards, General findings - Documents: expandable cards with completeness bar (green/yellow/red) - Findings grouped under their parent document - Each card shows: title, word count, findings count, % completeness - Findings without doc_title go to "Allgemeine Findings" section Email Summary (agent_scan_helpers): - Findings listed under their parent document - General findings in separate section - No more flat mixed list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 09:56:29 +02:00
Benjamin Admin	d816cf8d3a	fix: missing closing brace in GetBuiltinHazardPatterns() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 09:36:23 +02:00
Benjamin Admin	8dd1581fae	feat: IACE SIL/PL calculator + Cobot patterns + library extensions SIL/PL Calculator: Deterministic S×E×P → PL (a-e) → SIL (1-3) mapping Cobot Patterns (HP059-HP065): Human-robot collision, afterrun, misprogramming Press Patterns split into separate file (500-line guardrail) 5 new components (C136-C140), 5 new tags, 18 keyword entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 09:29:03 +02:00
Benjamin Admin	ea8353f1a0	fix: Scan progress display — separate progress state, guard ScanResult render - scanProgress state tracks live progress (not mixed into scanData) - ScanResult only renders when scanData.services exists (prevents crash) - Purple progress bar with spinner shows current step during scan - Fixes: TypeError 's.services.filter' when progress data set as scanData Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 08:29:03 +02:00
Benjamin Admin	d80cb9c8e4	feat: IACE Interview Frontend — 3 Modi (Interview/Wizard/Formular) CE-Risikobeurteilung Datenerfassung mit 3 wählbaren Eingabe-Modi: 1. Interview-Modus (Chat-artig): Fragen werden nacheinander gestellt wie im Kundengespräch. Antwort-Historie sichtbar. 2. Wizard-Modus: Schritt-für-Schritt durch 8 Sektionen. 3. Formular-Modus: Alle Sektionen als Accordion auf einer Seite. 20 strukturierte Fragen in 8 Abschnitten: - Maschinenbeschreibung (Name, Typ, Baugruppen) - Lebensphasen (Betrieb, Einrichten, Wartung) - Bestimmungsgemäße Verwendung - Vorhersehbare Fehlanwendung - Qualifikation der Benutzer - Räumliche/Zeitliche Grenzen - Technische Daten (Kräfte, Spannungen, Temperaturen, Drehzahlen) - Umgebungsbedingungen answersToNarrativeText() konvertiert alle Antworten in den Freitext der an POST /parse-narrative gesendet wird. Ergebnis-Panel zeigt: Komponenten, Gefahren, Patterns, Energiequellen. URL: /sdk/iace/[projectId]/interview Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 08:22:59 +02:00
Benjamin Admin	cb607bf228	feat: Async scan with polling — no more timeout issues Fundamental fix: scans now run asynchronously with progress polling. Backend: - POST /scan starts background task, returns scan_id immediately - GET /scan/{scan_id} returns status + progress + result when done - 7 progress steps shown: Website scan, DSI discovery, DSE analysis, SOLL/IST comparison, corrections, report, email - In-memory job store (dict with scan_id → status/result) - No timeout limits on scan duration Frontend: - POST starts scan, receives scan_id - Polls GET every 5 seconds (max 120 attempts = 10 min) - Shows live progress message during scan - Displays result when completed, error when failed Proxy: - POST timeout reduced to 30s (just starts the job) - GET timeout 10s (just status check) - No more 504/connection-dropped errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 07:30:09 +02:00
Benjamin Admin	d7b287889e	fix: IACE parser handler — use MatchOutput.SuggestedHazards instead of MatchedPatterns fields Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 07:18:55 +02:00
Benjamin Admin	d4b7943d54	feat: IACE deterministic narrative parser + library extensions Library Extensions: - 15 new components (C121-C135): knee lever, hydraulic ram, lubrication system, extraction system, vibrating plate, die tooling, transfer system, hoist, chute, oil drip tray, pressure relief valve, die space, flywheel, bin changeover station, inspection scale - 8 new tags: person_under_load, two_hand_control_required, thermal_accumulation, mechanical_transmission, oil_mist_risk, rapid_energy_release, gravity_suspended_load, bypass_risk - 14 new patterns (HP045-HP058): ram drop, die space crushing, oil mist inhalation, hot workpiece burns, suspended load, transfer draw-in, ejection fall, accumulator pressure release, impact noise, flywheel residual energy, guard bypass, two-hand misoperation, oil leakage, ergonomic bin changeover Deterministic Parser (NO LLM): - keyword_dictionary.go: ~100 entries mapping DE/EN keywords to component IDs, energy source IDs, and tags - narrative_parser.go: ParseNarrative() extracts components, energy sources, lifecycle phases, roles, tech specs, and context tags from free-text machine descriptions via keyword matching + regex - Tech spec regex: extracts kN, V, °C, bar, kW, rpm values and derives energy sources + severity tags automatically - iace_handler_parser.go: POST /projects/:id/parse-narrative endpoint chains parser → pattern engine → hazard suggestions Test: Paste Kniehebelpresse description → should detect 10+ components, 15+ hazards, all deterministically without LLM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:29:18 +02:00
Benjamin Admin	47ec792acf	fix: raise scan proxy timeout from 3 to 10 min (50 pages + 20 DSI docs + LLM) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:25:33 +02:00
Benjamin Admin	f3e44cf59f	fix: restore all missing consent-tester service modules banner_detector.py, script_analyzer.py, category_tester.py, authenticated_scanner.py were only on the feature branch — needed for consent-tester to start. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:14:26 +02:00
Benjamin Admin	3fade26d89	fix: restore consent-tester requirements.txt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:06:50 +02:00
Benjamin Admin	797ed667a2	fix: restore consent-tester Dockerfile (was lost from main) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 00:05:19 +02:00
Benjamin Admin	a3f7fb93f4	fix: Scan quality — raise page limit, use full DSI text for checks Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50 Bug 2: DSI documents checked against text_preview (500 chars) — now uses full_text (10,000 chars) for Art. 13 mandatory field checks Bug 3: DSE text not found when Playwright misses DSE page — now falls back to DSI Discovery full_text as second source Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 23:51:03 +02:00
Benjamin Admin	f967480cd9	fix: Add missing service_registry.py to main Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 23:34:00 +02:00
Benjamin Admin	a18ef16378	fix: Add missing service modules required by agent_scan_routes These files existed on the feature branch but were never cherry-picked to main, causing ModuleNotFoundError on import: - dse_parser.py — parses DSE HTML into structured sections - dse_matcher.py — matches detected services against DSE sections - mandatory_content_checker.py — checks Art. 13 DSGVO mandatory fields - legal_basis_validator.py — validates legal basis (lit. a-f) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 23:22:30 +02:00
Benjamin Admin	f960bd052a	fix: Add missing 'import re' to agent_scan_routes.py NameError: name 're' is not defined at line 146 — the import was accidentally removed when extracting helper functions to agent_scan_helpers.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:59:53 +02:00
Benjamin Admin	a846bd8910	fix: Exhaustive crawl — no arbitrary page/document limits Both scanners now search until done, not until a counter runs out: playwright_scanner.py: - Default max_pages raised from 15 to 50 - Added 3-minute timeout as safety net - Recursive link discovery on EVERY visited page (not just DSE pages) - Stops when: all links visited OR max_pages OR timeout dsi_discovery.py: - Default max_documents raised from 30 to 100 - Added 5-minute timeout as safety net - Recursive: on each visited page, searches for MORE DSI links - Processes ALL discovered links exhaustively - Stops when: no more pending links OR max_documents OR timeout The scanners now behave like a real user: they follow every relevant link they find, and on each new page they look for more links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:21:57 +02:00
Benjamin Admin	48146cddaf	feat: DSI document discovery + completeness check in agent scan workflow Agent scan now automatically: 1. Discovers all legal documents via consent-tester /dsi-discovery endpoint 2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum 3. Checks completeness against type-specific checklists: - DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes, legal basis, recipients, third-country, retention, rights, complaint) - AGB: §305ff BGB (scope, contract formation, liability, jurisdiction) - Widerruf: §355 BGB (right info, 14-day deadline, form, consequences) 4. Adds findings per document to scan results 5. Shows discovered documents with completeness % in email summary 6. Returns discovered_documents list in API response New files: - dsi_document_checker.py (229 LOC) — checklists + classifier - agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 22:10:13 +02:00
Benjamin Admin	4e63a6050d	feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie) New service: dsi_discovery.py — finds ALL legal documents on any website: - Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS - Structure-agnostic: accordions, sidebars, footers, inline links, tabs - Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links - Language-agnostic: 26 EU/EEA languages with document-type keywords Document types discovered: - Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO) - AGB / Terms of Service / Nutzungsbedingungen - Widerrufsbelehrung / Right of Withdrawal (§355 BGB) - Cookie-Richtlinie / Cookie Policy - All cross-domain variants (e.g. help.instagram.com from instagram.com) API: POST /dsi-discovery { url, max_documents } Returns: list of documents with title, url, language, type, word_count, text_preview Features: - Expands all accordions, details, tabs, dropdowns before scanning - Follows cross-domain links (same registrable domain) - Re-expands after navigation back to source page - Handles anchor links (#sections) separately from full pages Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 21:56:55 +02:00
Benjamin Admin	74dddbfa0f	feat: Legally vetted cookie banner translations for 22 EU/EEA languages 22 languages: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, HR, HU, IT, LT, LV, NL, PL, PT, RO, SK, SL, SV Each language includes 20 fields: - Banner title, description, accept/reject/save buttons - Privacy notice: "zur Kenntnis genommen" pattern (NOT "zugestimmt") - Terms: "gelesen und stimme zu" pattern (contract = agreement correct) - EWR-only toggle label + info text - 4 category names + descriptions - Vendor/blocked labels, imprint + privacy policy links Legal precision: - DSE = Informationspflicht Art. 13 DSGVO → "acknowledged/zur Kenntnis" - Nutzungsbedingungen = Vertrag → "agree/zustimmen" is correct - No passive consent formulations - No coupling patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 10:56:36 +02:00
Benjamin Admin	b997b4a475	feat: 9 new banner checks (12-20), total 20 compliance checks Check 12: Click count — reject requires more clicks than accept (CNIL 150M EUR) Check 13: Color contrast — reject button invisible (same bg as banner) Check 14: Google Consent Mode — analytics_storage 'granted' as default Check 15: Pre-consent cookies — tracking cookies set before any interaction Check 16: Registration coupling — login button = consent (Art. 7(4) DSGVO) Check 17: Language mismatch — banner vs page language (all 26 EU languages) Check 18: Consent cookie expiry — >13 months violates CNIL guidelines Check 19: Nudging — reject button below fold / requires scrolling Check 20: Emotional language (Stirring) — "volle Funktionalitaet" etc. Language detection covers: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, GA, HR, HU, IS, IT, LT, LV, MT, NL, NO, PL, PT, RO, SK, SL, SV New file: banner_advanced_checks.py (396 LOC) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-04 08:39:00 +02:00

1 2 3 4 5 ...

647 Commits