Commit Graph

640 Commits

Author SHA1 Message Date
Benjamin Admin b50c4ec940 fix: RAG checker falls back to local Qdrant when Go SDK returns 401
Go SDK points to external Qdrant (qdrant-dev.breakpilot.ai) with expired API key.
Fallback: search directly in local Qdrant (bp-core-qdrant:6333) which has
all collections: bp_compliance_datenschutz, bp_compliance_gesetze, atomic_controls_dedup.

Search strategy:
1. Try Go SDK RAG endpoint (preferred, has embedding-based search)
2. Fallback: Qdrant scroll with text-based regulation filter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 14:23:52 +02:00
Benjamin Admin 090da0f71b feat: RAG-based document verification against 144K Control Library
New module: rag_document_checker.py
- Searches RAG (Qdrant) for controls relevant to document type
- Filters by regulation (DSGVO Art.13, TDDDG §25, BGB §355 etc.)
- LLM (Qwen 3.5:35b) verifies each control against document text
- Returns fulfilled/missing with evidence text + severity
- Supports: DSI, Cookie, Impressum, Widerruf, AGB, DSFA, AVV, Loeschkonzept

Integration in doc-check endpoint:
- Regex checklist runs first (fast, deterministic)
- RAG checks run after (semantic, catches what regex misses)
- Both results combined in single response

LLM prompt returns JSON: {fulfilled, evidence, issue, severity}
Think-tags stripped, JSON extracted from response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 13:19:15 +02:00
Benjamin Admin 13c5880f51 fix: Restrict sub-section detection to genuinely separate document types
Only Cookie and Widerruf sections are checked as separate documents.
Social Media, DSFA, Betroffenenrechte, Dienste von Drittanbietern are
part of the parent DSI and no longer generate false findings.

Added PLAN-rag-document-check.md for Phase 2:
- RAG-based checks with document-type-specific Controls
- DSFA checklist (Art. 35 + Landes-Listen)
- AVV checklist (Art. 28)
- Reference detection (sub-doc → parent doc)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 11:02:36 +02:00
Benjamin Admin 0416bb5d04 fix: Checklist expand — use index instead of URL (prevents all opening at once)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 10:56:44 +02:00
Benjamin Admin 539bc824fd feat: Auto-detect sub-sections within a page and check each separately
When a single URL contains multiple document sections (e.g. IHK DSI page
with Cookies, Social Media, Dienste von Drittanbietern), the system now:

1. Extracts full page text (main document check as before)
2. Splits text at heading boundaries (short uppercase lines)
3. Classifies each section: Cookie→cookie checklist, Social Media→DSI etc.
4. Runs type-specific checklist per section
5. Returns all results: main doc + sub-sections

Section type detection via SECTION_TYPE_MAP patterns:
- 'Cookie*' → §25 TDDDG checklist
- 'Dienste von Drittanbietern' → DSI checklist
- 'Social Media' → DSI checklist (Art. 26 joint controllership)
- 'Widerrufsrecht' → §355 BGB checklist
- 'Impressum' → §5 TMG checklist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 10:44:42 +02:00
Benjamin Admin 4c68caac4e feat: Multi-URL Document Check with full checklist visibility
New "Dokumenten-Pruefung" tab in Compliance Agent:
- User adds multiple URLs with document type (DSI, AGB, Impressum, Cookie, Widerruf)
- Each document loaded via Playwright, accordions expanded, text extracted
- Checked against type-specific legal checklist
- Optional: Cookie banner check via checkbox

Checklisten-UX (solves "100% looks like nothing was checked"):
- All checks shown per document: green checkmark + matched text excerpt
- Red X for missing fields with legal reference
- Builds user trust: "9 Punkte geprueft, alle bestanden"
- Expandable per document with completeness bar

New checklists:
- Impressum: §5 TMG (6 fields: name, address, contact, register, VAT, representative)
- Cookie-Richtlinie: §25 TDDDG (5 fields: types, purposes, retention, third-party, opt-out)

Backend:
- POST /agent/doc-check — async with polling (same pattern as /scan)
- DocCheckResult includes checks[] with passed/failed + matched_text
- dsi_document_checker returns all_checks in SCORE finding
- Email report shows per-document checklist

Files: agent_doc_check_routes.py (280 LOC), DocCheckTab.tsx (248 LOC),
ChecklistView.tsx (130 LOC), dsi_document_checker.py (+70 LOC)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 10:08:40 +02:00
Benjamin Admin 254dbab566 fix: Keep every scan in history (no dedup by URL)
Each scan is a separate entry so users can track changes over time.
Increased max entries from 20 to 50.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:31:17 +02:00
Benjamin Admin ef8e7e599f feat: IACE +40 DGUV-extended patterns (HP094-HP133) — 133 total
Mechanical extended (HP094-HP103): Cutting, impact, friction, high-pressure
jet, ejection of fragments, tripping, gear/chain entanglement, clothing
winding, pendulating loads, tool kickback

Electrical extended (HP104-HP109): Arc flash, capacitor residual charge,
static discharge, grounding fault, induced voltage, overcurrent fire

Hazardous substances (HP110-HP117): Dust explosion, solvent vapors,
cutting fluid irritation, welding fumes, chemical burns, suffocation
in confined spaces, biological contamination, asbestos release

Radiation (HP118-HP123): Laser eye injury, UV from welding, infrared
heat, EMF induction, ionizing radiation, glare

Fire/Explosion (HP124-HP130): Electrical overheating, gas/vapor explosion,
hydraulic oil fire, metal dust fire, pressure vessel burst, oxygen
enrichment, spontaneous combustion

Ergonomic extended (HP131-HP133): RSI, whole-body vibration, hand-arm vibration

Total pattern library: 133 patterns (44 builtin + 14 press + 7 cobot +
28 operational + 40 DGUV) + ~58 extended rule library

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:22:57 +02:00
Benjamin Admin 8fb2061e9b fix: Eliminate GA false positive + handle short DSI documents
Service detection:
- Only search script tags + src/href attributes for service patterns
- Prevents false positives from DSE text mentioning services
  (e.g. IHK DSE describes etracker, 'google analytics' in text)
- Technical patterns (with regex chars) still checked in full HTML

Short documents:
- Documents with < 200 words flagged as 'Kurzhinweis' instead of
  'MANGELHAFT' — too short for Art. 13 completeness check
- Prevents 96-word navigation pages from showing 8 missing fields

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:21:37 +02:00
Benjamin Admin 8d6959e8b2 fix: Expand Art. 13 patterns for generic matching across all websites
Complaint (Art. 13(2)(d)):
+ 'recht auf beschwerde', 'art. 77', 'beschwerde...wenden/einlegen',
  'zuständige behörde' — IHK uses 'Recht auf Beschwerde gem. Art. 77'

Legal basis (Art. 13(1)(c)):
+ 'gemäß Art.', '§ X IHKG/BDSG/LDSG/BBiG/TDDDG', 'einwilligung gem',
  'verarbeitung auf grundlage' — catches statutory references

Third country (Art. 13(1)(f)):
+ 'Übermittlung ausserhalb', 'EWR/EEA', 'Data Privacy Framework'

Retention (Art. 13(2)(a)):
+ 'Dauer der Speicherung', 'Aufbewahrungsdauer/-pflicht/-zeit',
  'gesetzliche Aufbewahrung' — common German DSE headings

All patterns are generic, not IHK-specific.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 17:45:02 +02:00
Benjamin Admin 85e82d0dfa feat: IACE 28 operational hazard patterns (HP066-HP093)
Fault Clearing (HP066-HP072): Jammed parts releasing, hose bursts,
unexpected restart, stored energy, intervention in running machine,
material jam, falling parts during fault clearing

Maintenance (HP073-HP079): Missing LOTO, falls from platforms,
hot parts contact, hazardous substances, electric shock, ergonomic
access, uncontrolled hydraulic lowering

Setup/Changeover (HP080-HP085): Crushing during tool change, burns
from hot tools, heavy tool drops, unintended stroke in setup mode,
wrong parameters, test cycle hits personnel

Transport/Install/Decommission (HP086-HP090): Machine tipping,
crushing during installation, uncontrolled commissioning movement,
residual media, sharp edges

Cleaning (HP091-HP093): Slipping, chemical exposure, draw-in

Lifecycle keywords expanded: werkzeugwechsel, stoerung, fehlersuche,
klemm, blockier, stau → trigger fault_clearing phase patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 17:42:38 +02:00
Benjamin Admin a349111a01 fix: Raise full_text limit 10K→50K + combine all DSI texts for checks
Two fixes:
1. consent-tester: full_text truncation raised from 10,000 to 50,000 chars
   (IHK Internetangebot has ~50K chars, Beschwerderecht was after 10K cutoff)
2. Backend: dse_text now combines Playwright HTML + ALL DSI discovery texts
   for mandatory content checking. Previously only used first 8K chars from
   one source, missing Verantwortlicher/DSB that were in DSI documents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 16:03:56 +02:00
Benjamin Admin 3ac8d0cba8 fix: IACE mitigations page — remove broken 'm.' prefix + accept 'protective' type
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 15:52:10 +02:00
Benjamin Admin e3ae35891f fix: 0% completeness bug — SCORE finding was not generated at 100%
Root cause: When all 9 Art. 13 checks passed (100%), no SCORE finding
was created (line: 'if pct < 100'). The backend then defaulted to
completeness=0 because it looked for the SCORE finding to extract the %.

Fix: Always generate SCORE finding, even at 100%. Added 'OK' severity
for fully compliant documents.

This was the cause of 8 documents showing '0% MANGELHAFT' despite
containing all required information.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 15:34:04 +02:00
Benjamin Admin 72761d6066 debug: Log DSI text lengths to diagnose 0% completeness bug
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 14:08:04 +02:00
Benjamin Admin e494cf62bb fix: Increase page load timeouts — IHK site needs >30s for networkidle
- Initial page.goto timeout: 30s → 60s (IHK loads many JS resources)
- Per-page navigation timeout: 20s → 45s (heavy JS sites)
- Reduced extra wait from 3s+1s back to 2s+0.5s (goto timeout handles slow loads)
- Playwright scanner page timeout: 20s → 45s

Root cause: IHK website has heavy JavaScript that takes >30s to reach
'networkidle' state, causing DSI discovery to fail immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 13:10:59 +02:00
Benjamin Admin d547e63663 fix: DSI dedup prefers 'Datenschutzinformation*' titles + better JS content extraction
Bug 1 fix: When merging documents with identical word_count, prefer
titles starting with 'Datenschutzinformation' over generic section
headings like 'Zweck und Rechtsgrundlage'. This restores the main
'Datenschutzinformationen zum Internetangebot' document.

Bug 2 fix: After navigating to a document page, wait 3s (was 2s) for
JS content loading, then try 10+ content selectors before falling back
to body text (with nav/header/footer removed). Handles IHK-style JS
navigation where content loads after page.goto() completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 12:26:42 +02:00
Benjamin Admin b4f90ed113 fix: IACE components page — remove broken 'c.' prefix from refactor
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 12:20:09 +02:00
Benjamin Admin daa47bb7ab feat: Scan history — shows last 20 scans with URL, date, findings count
- localStorage-based scan history (persists across sessions)
- Each completed scan adds entry: URL, timestamp, findings count, docs count
- 'Letzte Scans' section below results shows clickable history entries
- Click loads URL into form (and shows cached result if same URL)
- Max 20 entries, deduplicates by URL (latest scan wins)
- History visible in 'Website-Scan' tab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 11:52:35 +02:00
Benjamin Admin 6c5e086356 fix: DSI dedup — skip anchor links, filter noise, merge duplicates + fix false positives
Dedup fixes:
- Anchor links (#cookies, #betroffenenrechte) on same page are skipped entirely
- Noise titles filtered: 'drucken', 'nach oben', 'Datenschutz' (too generic)
- Documents with < 50 words filtered (navigation snippets)
- Documents with identical word_count merged (same page, different title)
- URL-only titles filtered

False positive fixes (dsi_document_checker.py):
- 'Kontaktdaten des Verantwortlichen' pattern for controller check
- 'Zweck und Rechtsgrundlage' combined heading pattern
- 'Welche Daten werden verarbeitet' question-style headings
- 'Betroffenenrechte' as standalone heading
- 'Welche Rechte hat der Betroffene' question pattern
- 'Daten werden geloescht' retention pattern
- 'Auftragsverarbeiter' as recipient indicator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 11:41:07 +02:00
Benjamin Admin 8e40155459 feat: Scan state persists across navigation — resume polling on return
- URL, mode, tab, scan result persisted in localStorage
- Active scan_id stored — polling resumes when returning to page
- Scan results survive navigation to other SDK modules
- 'Scan laeuft noch...' shown when returning to in-progress scan
- Cleans up localStorage when scan completes or fails

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 10:47:39 +02:00
Benjamin Admin b5cf25f6ab fix: IACE overview null-check for risk_summary (empty projects)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 10:44:16 +02:00
Benjamin Admin 7c7513525e feat: Document-centric scan results + DSI deduplication
DSI Dedup (consent-tester):
- Only H1/H2 headings count as documents (not H3/H4 sub-sections)
- Sub-sections (Cookies, Betroffenenrechte, Social Media) are part of
  parent document's full text, not separate documents
- Reduces IHK result from 30 to ~11 real documents

Backend (agent_scan_routes):
- ScanFinding gets doc_title field linking each finding to its document
- doc_title set when creating DSI findings for document attribution

Frontend (ScanResult.tsx):
- 3 sections: Services table, Document cards, General findings
- Documents: expandable cards with completeness bar (green/yellow/red)
- Findings grouped under their parent document
- Each card shows: title, word count, findings count, % completeness
- Findings without doc_title go to "Allgemeine Findings" section

Email Summary (agent_scan_helpers):
- Findings listed under their parent document
- General findings in separate section
- No more flat mixed list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 09:56:29 +02:00
Benjamin Admin d816cf8d3a fix: missing closing brace in GetBuiltinHazardPatterns()
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 09:36:23 +02:00
Benjamin Admin 8dd1581fae feat: IACE SIL/PL calculator + Cobot patterns + library extensions
SIL/PL Calculator: Deterministic S×E×P → PL (a-e) → SIL (1-3) mapping
Cobot Patterns (HP059-HP065): Human-robot collision, afterrun, misprogramming
Press Patterns split into separate file (500-line guardrail)
5 new components (C136-C140), 5 new tags, 18 keyword entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 09:29:03 +02:00
Benjamin Admin ea8353f1a0 fix: Scan progress display — separate progress state, guard ScanResult render
- scanProgress state tracks live progress (not mixed into scanData)
- ScanResult only renders when scanData.services exists (prevents crash)
- Purple progress bar with spinner shows current step during scan
- Fixes: TypeError 's.services.filter' when progress data set as scanData

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 08:29:03 +02:00
Benjamin Admin d80cb9c8e4 feat: IACE Interview Frontend — 3 Modi (Interview/Wizard/Formular)
CE-Risikobeurteilung Datenerfassung mit 3 wählbaren Eingabe-Modi:

1. Interview-Modus (Chat-artig): Fragen werden nacheinander gestellt
   wie im Kundengespräch. Antwort-Historie sichtbar.
2. Wizard-Modus: Schritt-für-Schritt durch 8 Sektionen.
3. Formular-Modus: Alle Sektionen als Accordion auf einer Seite.

20 strukturierte Fragen in 8 Abschnitten:
- Maschinenbeschreibung (Name, Typ, Baugruppen)
- Lebensphasen (Betrieb, Einrichten, Wartung)
- Bestimmungsgemäße Verwendung
- Vorhersehbare Fehlanwendung
- Qualifikation der Benutzer
- Räumliche/Zeitliche Grenzen
- Technische Daten (Kräfte, Spannungen, Temperaturen, Drehzahlen)
- Umgebungsbedingungen

answersToNarrativeText() konvertiert alle Antworten in den Freitext
der an POST /parse-narrative gesendet wird.
Ergebnis-Panel zeigt: Komponenten, Gefahren, Patterns, Energiequellen.

URL: /sdk/iace/[projectId]/interview

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 08:22:59 +02:00
Benjamin Admin cb607bf228 feat: Async scan with polling — no more timeout issues
Fundamental fix: scans now run asynchronously with progress polling.

Backend:
- POST /scan starts background task, returns scan_id immediately
- GET /scan/{scan_id} returns status + progress + result when done
- 7 progress steps shown: Website scan, DSI discovery, DSE analysis,
  SOLL/IST comparison, corrections, report, email
- In-memory job store (dict with scan_id → status/result)
- No timeout limits on scan duration

Frontend:
- POST starts scan, receives scan_id
- Polls GET every 5 seconds (max 120 attempts = 10 min)
- Shows live progress message during scan
- Displays result when completed, error when failed

Proxy:
- POST timeout reduced to 30s (just starts the job)
- GET timeout 10s (just status check)
- No more 504/connection-dropped errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 07:30:09 +02:00
Benjamin Admin d7b287889e fix: IACE parser handler — use MatchOutput.SuggestedHazards instead of MatchedPatterns fields
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 07:18:55 +02:00
Benjamin Admin d4b7943d54 feat: IACE deterministic narrative parser + library extensions
Library Extensions:
- 15 new components (C121-C135): knee lever, hydraulic ram, lubrication
  system, extraction system, vibrating plate, die tooling, transfer system,
  hoist, chute, oil drip tray, pressure relief valve, die space, flywheel,
  bin changeover station, inspection scale
- 8 new tags: person_under_load, two_hand_control_required,
  thermal_accumulation, mechanical_transmission, oil_mist_risk,
  rapid_energy_release, gravity_suspended_load, bypass_risk
- 14 new patterns (HP045-HP058): ram drop, die space crushing, oil mist
  inhalation, hot workpiece burns, suspended load, transfer draw-in,
  ejection fall, accumulator pressure release, impact noise, flywheel
  residual energy, guard bypass, two-hand misoperation, oil leakage,
  ergonomic bin changeover

Deterministic Parser (NO LLM):
- keyword_dictionary.go: ~100 entries mapping DE/EN keywords to
  component IDs, energy source IDs, and tags
- narrative_parser.go: ParseNarrative() extracts components, energy
  sources, lifecycle phases, roles, tech specs, and context tags from
  free-text machine descriptions via keyword matching + regex
- Tech spec regex: extracts kN, V, °C, bar, kW, rpm values and
  derives energy sources + severity tags automatically
- iace_handler_parser.go: POST /projects/:id/parse-narrative endpoint
  chains parser → pattern engine → hazard suggestions

Test: Paste Kniehebelpresse description → should detect 10+ components,
15+ hazards, all deterministically without LLM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:29:18 +02:00
Benjamin Admin 47ec792acf fix: raise scan proxy timeout from 3 to 10 min (50 pages + 20 DSI docs + LLM)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:25:33 +02:00
Benjamin Admin f3e44cf59f fix: restore all missing consent-tester service modules
banner_detector.py, script_analyzer.py, category_tester.py, authenticated_scanner.py
were only on the feature branch — needed for consent-tester to start.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:14:26 +02:00
Benjamin Admin 3fade26d89 fix: restore consent-tester requirements.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:06:50 +02:00
Benjamin Admin 797ed667a2 fix: restore consent-tester Dockerfile (was lost from main)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 00:05:19 +02:00
Benjamin Admin a3f7fb93f4 fix: Scan quality — raise page limit, use full DSI text for checks
Bug 1: max_pages was hardcoded to 15 in backend call — raised to 50
Bug 2: DSI documents checked against text_preview (500 chars) — now uses
       full_text (10,000 chars) for Art. 13 mandatory field checks
Bug 3: DSE text not found when Playwright misses DSE page — now falls
       back to DSI Discovery full_text as second source
Bug 4: Backend timeout 120s too short for 50 pages — raised to 300s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:51:03 +02:00
Benjamin Admin f967480cd9 fix: Add missing service_registry.py to main
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:34:00 +02:00
Benjamin Admin a18ef16378 fix: Add missing service modules required by agent_scan_routes
These files existed on the feature branch but were never cherry-picked
to main, causing ModuleNotFoundError on import:
- dse_parser.py — parses DSE HTML into structured sections
- dse_matcher.py — matches detected services against DSE sections
- mandatory_content_checker.py — checks Art. 13 DSGVO mandatory fields
- legal_basis_validator.py — validates legal basis (lit. a-f)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 23:22:30 +02:00
Benjamin Admin f960bd052a fix: Add missing 'import re' to agent_scan_routes.py
NameError: name 're' is not defined at line 146 — the import was
accidentally removed when extracting helper functions to agent_scan_helpers.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:59:53 +02:00
Benjamin Admin a846bd8910 fix: Exhaustive crawl — no arbitrary page/document limits
Both scanners now search until done, not until a counter runs out:

playwright_scanner.py:
- Default max_pages raised from 15 to 50
- Added 3-minute timeout as safety net
- Recursive link discovery on EVERY visited page (not just DSE pages)
- Stops when: all links visited OR max_pages OR timeout

dsi_discovery.py:
- Default max_documents raised from 30 to 100
- Added 5-minute timeout as safety net
- Recursive: on each visited page, searches for MORE DSI links
- Processes ALL discovered links exhaustively
- Stops when: no more pending links OR max_documents OR timeout

The scanners now behave like a real user: they follow every relevant
link they find, and on each new page they look for more links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:21:57 +02:00
Benjamin Admin 48146cddaf feat: DSI document discovery + completeness check in agent scan workflow
Agent scan now automatically:
1. Discovers all legal documents via consent-tester /dsi-discovery endpoint
2. Classifies each as DSE/AGB/Widerruf/Cookie/Impressum
3. Checks completeness against type-specific checklists:
   - DSE: 9 Art. 13 DSGVO mandatory fields (controller, DPO, purposes,
     legal basis, recipients, third-country, retention, rights, complaint)
   - AGB: §305ff BGB (scope, contract formation, liability, jurisdiction)
   - Widerruf: §355 BGB (right info, 14-day deadline, form, consequences)
4. Adds findings per document to scan results
5. Shows discovered documents with completeness % in email summary
6. Returns discovered_documents list in API response

New files:
- dsi_document_checker.py (229 LOC) — checklists + classifier
- agent_scan_helpers.py (109 LOC) — extracted summary builder + corrections

Refactor: agent_scan_routes.py 537→448 LOC (under 500 budget)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 22:10:13 +02:00
Benjamin Admin 4e63a6050d feat: Generic legal document discovery (DSI, AGB, Widerruf, Cookie-Richtlinie)
New service: dsi_discovery.py — finds ALL legal documents on any website:
- Technology-agnostic: HTML, SPA, WordPress, Typo3, custom CMS
- Structure-agnostic: accordions, sidebars, footers, inline links, tabs
- Format-agnostic: HTML pages, anchor sections, PDFs, cross-domain links
- Language-agnostic: 26 EU/EEA languages with document-type keywords

Document types discovered:
- Datenschutzinformationen / Privacy Policies (Art. 13/14 DSGVO)
- AGB / Terms of Service / Nutzungsbedingungen
- Widerrufsbelehrung / Right of Withdrawal (§355 BGB)
- Cookie-Richtlinie / Cookie Policy
- All cross-domain variants (e.g. help.instagram.com from instagram.com)

API: POST /dsi-discovery { url, max_documents }
Returns: list of documents with title, url, language, type, word_count, text_preview

Features:
- Expands all accordions, details, tabs, dropdowns before scanning
- Follows cross-domain links (same registrable domain)
- Re-expands after navigation back to source page
- Handles anchor links (#sections) separately from full pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 21:56:55 +02:00
Benjamin Admin 74dddbfa0f feat: Legally vetted cookie banner translations for 22 EU/EEA languages
22 languages: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, HR, HU, IT,
LT, LV, NL, PL, PT, RO, SK, SL, SV

Each language includes 20 fields:
- Banner title, description, accept/reject/save buttons
- Privacy notice: "zur Kenntnis genommen" pattern (NOT "zugestimmt")
- Terms: "gelesen und stimme zu" pattern (contract = agreement correct)
- EWR-only toggle label + info text
- 4 category names + descriptions
- Vendor/blocked labels, imprint + privacy policy links

Legal precision:
- DSE = Informationspflicht Art. 13 DSGVO → "acknowledged/zur Kenntnis"
- Nutzungsbedingungen = Vertrag → "agree/zustimmen" is correct
- No passive consent formulations
- No coupling patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 10:56:36 +02:00
Benjamin Admin b997b4a475 feat: 9 new banner checks (12-20), total 20 compliance checks
Check 12: Click count — reject requires more clicks than accept (CNIL 150M EUR)
Check 13: Color contrast — reject button invisible (same bg as banner)
Check 14: Google Consent Mode — analytics_storage 'granted' as default
Check 15: Pre-consent cookies — tracking cookies set before any interaction
Check 16: Registration coupling — login button = consent (Art. 7(4) DSGVO)
Check 17: Language mismatch — banner vs page language (all 26 EU languages)
Check 18: Consent cookie expiry — >13 months violates CNIL guidelines
Check 19: Nudging — reject button below fold / requires scrolling
Check 20: Emotional language (Stirring) — "volle Funktionalitaet" etc.

Language detection covers: BG, CS, DA, DE, EL, EN, ES, ET, FI, FR, GA,
HR, HU, IS, IT, LT, LV, MT, NL, NO, PL, PT, RO, SK, SL, SV

New file: banner_advanced_checks.py (396 LOC)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 08:39:00 +02:00
Benjamin Admin 5d138f265b feat: 3 new banner legal checks (11 total) + extract banner_text_checker
New checks (from EUIPO reference case):
- Check 9: Third-party DSE link — detects when consent dialog links to
  external domain's privacy policy instead of own DSE (Art. 13 DSGVO)
- Check 10: Dark-pattern language — detects "muessen/erforderlich" for
  non-essential cookies suggesting false technical necessity (EDPB Rn. 70)
- Check 11: Non-modal dismiss = consent — detects when clicking outside
  dialog closes it (possibly treating as consent, Planet49 violation)

Refactor: extracted _check_banner_text (375 LOC) from consent_scanner.py
into services/banner_text_checker.py to keep both files under 500 LOC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 08:02:46 +02:00
Benjamin Admin 2fb417c784 feat: Add IAM system knowledge + CMP FAQ to Compliance Advisor soul
New FAQ sections the advisor can answer:
- "Was ist WSO2 Identity Server?" — explains systemic GDPR template problem
- "Welche IAM-Systeme haben aehnliche Probleme?" — WSO2, Keycloak, Azure AD B2C,
  Auth0, Cognito, ForgeRock comparison table
- "Was ist das Koppelungsverbot?" — Art. 7(4) DSGVO with practical examples
- CMP product knowledge — all 9 modules, EWR-Only feature explanation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 07:37:17 +02:00
Benjamin Admin a1272390ff fix: Cookie banner closeable + sidebar accessible while banner is open
- X button to close banner (SDK admin context only)
- Overlay leaves sidebar area accessible (ml-16/ml-64)
- Click overlay backdrop to dismiss
- Preview page: close banner on API error (don't trap user)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 09:08:05 +02:00
Benjamin Admin 6af35dbf5f fix: Route all banner API calls through Next.js proxy (SSL cert fix)
Browser blocks direct calls to backend-compliance:8093 due to
self-signed SSL certificate. All banner API calls now go through
Next.js API proxy at /api/sdk/v1/banner/* which runs server-side.

- New catch-all proxy: /api/sdk/v1/banner/[[...path]]/route.ts
  Maps to backend-compliance:8002/api/compliance/banner/*
- Preview page: uses /api/sdk/v1/banner/ instead of https://macmini:8093
- CMP Dashboard: uses proxy for banner stats + compliance proxy for DSR/einwilligungen
- Fixes: banner not closeable due to API errors, consent not saving

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:53:38 +02:00
Benjamin Admin 4834e8ad5c feat: CMP Dashboard — aggregated consent, DSR, and compliance overview
- New route /sdk/cmp with full CMP dashboard
- 4 KPI cards: total consents, active consents, open DSR requests, configured sites
- Cookie category acceptance bars (necessary/statistics/marketing/functional)
- DSR breakdown: by status, by type (Art. 15-21), avg processing time, overdue count
- 9-point compliance checklist (banner, DSE, impressum, Art.7 proof, DSR, loeschfristen,
  vendor AVV, email templates, EWR-only mode) — each links to relevant module
- 8 module cards with icons linking to all CMP sub-modules
- Real API integration: /banner/admin/stats, /einwilligungen/consents/stats, /dsr/stats
- Dashboard link added as first entry in CMP sidebar section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:44:16 +02:00
Benjamin Admin 89ff62e534 feat: CMP sidebar section + cookie banner live preview page
CMP Section in Sidebar:
- New "CMP" group with purple accent, above other module sections
- Links: Cookie-Banner, Live-Vorschau, Consent-Records, Consent-Verwaltung,
  Vendor-Compliance, DSR Portal, Loeschfristen, E-Mail-Templates

Live Preview (/sdk/cookie-banner/preview):
- Simulated "MusterShop GmbH" website with full cookie banner
- Real API calls to POST /banner/consent (saves to DB)
- EWR-Only toggle functional in preview
- API Debug panel shows fingerprint, consent status, blocked vendors
- Response JSON viewer for API debugging
- Links to verify in Consent-Verwaltung, Consent-Records, DSR Portal
- "Consent zuruecksetzen" button to re-test
- Footer "Cookie-Einstellungen" link to reopen banner

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:05:31 +02:00
Benjamin Admin 340fd27a1a cleanup: Remove duplicate cookie-banner route, redirect to /sdk/cookie-banner
- Deleted 6 unused components from /sdk/einwilligungen/cookie-banner/_components/
- Replaced page.tsx with Next.js redirect() to /sdk/cookie-banner
- Updated EinwilligungenNavTabs link to /sdk/cookie-banner
- Updated catalog page link to /sdk/cookie-banner
- Single source of truth: /sdk/cookie-banner (Step in "Rechtliche Texte")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 07:48:06 +02:00