docs: update README and add help-chat, deduplication docs

README.md: - Add DAST, pentesting, code graph, AI chat, MCP, help chat to features table - Add Gitea to tracker list, multi-language LLM triage note - Update architecture diagram with all 5 workspace crates - Add new API endpoints (graph, DAST, chat, help, pentest) - Update dashboard pages table (remove Settings, add 6 new pages) - Update project structure with new directories - Add Keycloak, Chromium to external services New docs: - docs/features/help-chat.md — Help chat assistant usage, API, config - docs/features/deduplication.md — Finding dedup across SAST, DAST, PR, issues Updated: - docs/features/overview.md — Add help chat section, update tracker list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 09:49:11 +02:00
parent 263a4e654a
commit 4d7efea683
4 changed files with 203 additions and 46 deletions
@@ -28,9 +28,9 @@

 ## About

-Compliance Scanner is an autonomous agent that continuously monitors git repositories for security vulnerabilities, GDPR/OAuth compliance patterns, and dependency risks. It creates issues in external trackers (GitHub/GitLab/Jira) with evidence and remediation suggestions, reviews pull requests, and exposes a Dioxus-based dashboard for visualization.
+Compliance Scanner is an autonomous agent that continuously monitors git repositories for security vulnerabilities, GDPR/OAuth compliance patterns, and dependency risks. It creates issues in external trackers (GitHub/GitLab/Jira/Gitea) with evidence and remediation suggestions, reviews pull requests with multi-pass LLM analysis, runs autonomous penetration tests, and exposes a Dioxus-based dashboard for visualization.

-> **How it works:** The agent runs as a lazy daemon -- it only scans when new commits are detected, triggered by cron schedules or webhooks. LLM-powered triage filters out false positives and generates actionable remediation.
+> **How it works:** The agent runs as a lazy daemon -- it only scans when new commits are detected, triggered by cron schedules or webhooks. LLM-powered triage filters out false positives and generates actionable remediation with multi-language awareness.

 ## Features

@@ -41,31 +41,38 @@ Compliance Scanner is an autonomous agent that continuously monitors git reposit
 | **CVE Monitoring** | OSV.dev batch queries, NVD CVSS enrichment, SearXNG context |
 | **GDPR Patterns** | Detect PII logging, missing consent, hardcoded retention, missing deletion |
 | **OAuth Patterns** | Detect implicit grant, missing PKCE, token in localStorage, token in URLs |
-| **LLM Triage** | Confidence scoring via LiteLLM to filter false positives |
-| **Issue Creation** | Auto-create issues in GitHub, GitLab, or Jira with code evidence |
-| **PR Reviews** | Post security review comments on pull requests |
-| **Dashboard** | Fullstack Dioxus UI with findings, SBOM, issues, and statistics |
-| **Webhooks** | GitHub (HMAC-SHA256) and GitLab webhook receivers for push/PR events |
+| **LLM Triage** | Multi-language-aware confidence scoring (Rust, Python, Go, Java, Ruby, PHP, C++) |
+| **Issue Creation** | Auto-create issues in GitHub, GitLab, Jira, or Gitea with dedup via fingerprints |
+| **PR Reviews** | Multi-pass security review (logic, security, convention, complexity) with dedup |
+| **DAST Scanning** | Black-box security testing with endpoint discovery and parameter fuzzing |
+| **AI Pentesting** | Autonomous LLM-orchestrated penetration testing with encrypted reports |
+| **Code Graph** | Interactive code knowledge graph with impact analysis |
+| **AI Chat (RAG)** | Natural language Q&A grounded in repository source code |
+| **Help Assistant** | Documentation-grounded help chat accessible from every dashboard page |
+| **MCP Server** | Expose live security data to Claude, Cursor, and other AI tools |
+| **Dashboard** | Fullstack Dioxus UI with findings, SBOM, issues, DAST, pentest, and graph |
+| **Webhooks** | GitHub, GitLab, and Gitea webhook receivers for push/PR events |
+| **Finding Dedup** | SHA-256 fingerprint dedup for SAST, CWE-based dedup for DAST findings |

 ## Architecture

 ```
-┌─────────────────────────────────────────────────────────────┐
-│                    Cargo Workspace                           │
-├──────────────┬──────────────────┬───────────────────────────┤
-│ compliance-  │ compliance-      │ compliance-               │
-│ core         │ agent            │ dashboard                 │
-│ (lib)        │ (bin)            │ (bin, Dioxus 0.7.3)       │
-│              │                  │                           │
-│ Models       │ Scan Pipeline    │ Fullstack Web UI          │
-│ Traits       │ LLM Client      │ Server Functions           │
-│ Config       │ Issue Trackers   │ Charts + Tables           │
-│ Errors       │ Scheduler        │ Settings Page             │
-│              │ REST API         │                           │
-│              │ Webhooks         │                           │
-└──────────────┴──────────────────┴───────────────────────────┘
-                        │
-                   MongoDB (shared)
+┌──────────────────────────────────────────────────────────────────────────┐
+│                          Cargo Workspace                                 │
+├──────────────┬──────────────────┬──────────────┬──────────┬─────────────┤
+│ compliance-  │ compliance-      │ compliance-  │ complian-│ compliance- │
+│ core (lib)   │ agent (bin)      │ dashboard    │ ce-graph │ mcp (bin)   │
+│              │                  │ (bin)        │ (lib)    │             │
+│ Models       │ Scan Pipeline    │ Dioxus 0.7   │ Tree-    │ MCP Server  │
+│ Traits       │ LLM Client      │ Fullstack UI │ sitter   │ Live data   │
+│ Config       │ Issue Trackers   │ Help Chat    │ Graph    │ for AI      │
+│ Errors       │ Pentest Engine   │ Server Fns   │ Embedds  │ tools       │
+│              │ DAST Tools       │              │ RAG      │             │
+│              │ REST API         │              │          │             │
+│              │ Webhooks         │              │          │             │
+└──────────────┴──────────────────┴──────────────┴──────────┴─────────────┘
+                                 │
+                            MongoDB (shared)
 ```

 ## Scan Pipeline (7 Stages)
@@ -84,11 +91,16 @@ Compliance Scanner is an autonomous agent that continuously monitors git reposit
 |-------|-----------|
 | Shared Library | `compliance-core` -- models, traits, config |
 | Agent | Axum REST API, git2, tokio-cron-scheduler, Semgrep, Syft |
-| Dashboard | Dioxus 0.7.3 fullstack, Tailwind CSS |
+| Dashboard | Dioxus 0.7.3 fullstack, Tailwind CSS 4 |
+| Code Graph | `compliance-graph` -- tree-sitter parsing, embeddings, RAG |
+| MCP Server | `compliance-mcp` -- Model Context Protocol for AI tools |
+| DAST | `compliance-dast` -- dynamic application security testing |
 | Database | MongoDB with typed collections |
-| LLM | LiteLLM (OpenAI-compatible API) |
-| Issue Trackers | GitHub (octocrab), GitLab (REST v4), Jira (REST v3) |
+| LLM | LiteLLM (OpenAI-compatible API for chat, triage, embeddings) |
+| Issue Trackers | GitHub (octocrab), GitLab (REST v4), Jira (REST v3), Gitea |
 | CVE Sources | OSV.dev, NVD, SearXNG |
+| Auth | Keycloak (OAuth2/PKCE, SSO) |
+| Browser Automation | Chromium (headless, for pentesting and PDF generation) |

 ## Getting Started

@@ -151,20 +163,35 @@ The agent exposes a REST API on port 3001:
 | `GET` | `/api/v1/sbom` | List dependencies |
 | `GET` | `/api/v1/issues` | List cross-tracker issues |
 | `GET` | `/api/v1/scan-runs` | Scan execution history |
+| `GET` | `/api/v1/graph/:repo_id` | Code knowledge graph |
+| `POST` | `/api/v1/graph/:repo_id/build` | Trigger graph build |
+| `GET` | `/api/v1/dast/targets` | List DAST targets |
+| `POST` | `/api/v1/dast/targets` | Add DAST target |
+| `GET` | `/api/v1/dast/findings` | List DAST findings |
+| `POST` | `/api/v1/chat/:repo_id` | RAG-powered code chat |
+| `POST` | `/api/v1/help/chat` | Documentation-grounded help chat |
+| `POST` | `/api/v1/pentest/sessions` | Create pentest session |
+| `POST` | `/api/v1/pentest/sessions/:id/export` | Export encrypted pentest report |
 | `POST` | `/webhook/github` | GitHub webhook (HMAC-SHA256) |
 | `POST` | `/webhook/gitlab` | GitLab webhook (token verify) |
+| `POST` | `/webhook/gitea` | Gitea webhook |

 ## Dashboard Pages

 | Page | Description |
 |------|-------------|
-| **Overview** | Stat cards, severity distribution chart |
-| **Repositories** | Add/manage tracked repos, trigger scans |
-| **Findings** | Filterable table by severity, type, status |
+| **Overview** | Stat cards, severity distribution, AI chat cards, MCP status |
+| **Repositories** | Add/manage tracked repos, trigger scans, webhook config |
+| **Findings** | Filterable table by severity, type, status, scanner |
 | **Finding Detail** | Code evidence, remediation, suggested fix, linked issue |
-| **SBOM** | Dependency inventory with vulnerability badges |
-| **Issues** | Cross-tracker view (GitHub + GitLab + Jira) |
-| **Settings** | Configure LiteLLM, tracker tokens, SearXNG URL |
+| **SBOM** | Dependency inventory with vulnerability badges, license summary |
+| **Issues** | Cross-tracker view (GitHub + GitLab + Jira + Gitea) |
+| **Code Graph** | Interactive architecture visualization, impact analysis |
+| **AI Chat** | RAG-powered Q&A about repository code |
+| **DAST** | Dynamic scanning targets, findings, and scan history |
+| **Pentest** | AI-driven pentest sessions, attack chain visualization |
+| **MCP Servers** | Model Context Protocol server management |
+| **Help Chat** | Floating assistant (available on every page) for product Q&A |

 ## Project Structure

@@ -173,19 +200,24 @@ compliance-scanner/
 ├── compliance-core/        Shared library (models, traits, config, errors)
 ├── compliance-agent/       Agent daemon (pipeline, LLM, trackers, API, webhooks)
 │   └── src/
-│       ├── pipeline/       7-stage scan pipeline
-│       ├── llm/            LiteLLM client, triage, descriptions, fixes, PR review
-│       ├── trackers/       GitHub, GitLab, Jira integrations
-│       ├── api/            REST API (Axum)
-│       └── webhooks/       GitHub + GitLab webhook receivers
+│       ├── pipeline/       7-stage scan pipeline, dedup, PR reviews, code review
+│       ├── llm/            LiteLLM client, triage, descriptions, fixes, review prompts
+│       ├── trackers/       GitHub, GitLab, Jira, Gitea integrations
+│       ├── pentest/        AI-driven pentest orchestrator, tools, reports
+│       ├── rag/            RAG pipeline, chunking, embedding
+│       ├── api/            REST API (Axum), help chat
+│       └── webhooks/       GitHub, GitLab, Gitea webhook receivers
 ├── compliance-dashboard/   Dioxus fullstack dashboard
 │   └── src/
-│       ├── components/     Reusable UI components
-│       ├── infrastructure/ Server functions, DB, config
-│       └── pages/          Full page views
+│       ├── components/     Reusable UI (sidebar, help chat, attack chain, etc.)
+│       ├── infrastructure/ Server functions, DB, config, auth
+│       └── pages/          Full page views (overview, DAST, pentest, graph, etc.)
+├── compliance-graph/       Code knowledge graph (tree-sitter, embeddings, RAG)
+├── compliance-dast/        Dynamic application security testing
+├── compliance-mcp/         Model Context Protocol server
+├── docs/                   VitePress documentation site
 ├── assets/                 Static assets (CSS, icons)
-├── styles/                 Tailwind input stylesheet
-└── bin/                    Dashboard binary entrypoint
+└── styles/                 Tailwind input stylesheet
 ```

 ## External Services
@@ -193,10 +225,12 @@ compliance-scanner/
 | Service | Purpose | Default URL |
 |---------|---------|-------------|
 | MongoDB | Persistence | `mongodb://localhost:27017` |
-| LiteLLM | LLM proxy for triage and generation | `http://localhost:4000` |
+| LiteLLM | LLM proxy (chat, triage, embeddings) | `http://localhost:4000` |
 | SearXNG | CVE context search | `http://localhost:8888` |
+| Keycloak | Authentication (OAuth2/PKCE, SSO) | `http://localhost:8080` |
 | Semgrep | SAST scanning | CLI tool |
 | Syft | SBOM generation | CLI tool |
+| Chromium | Headless browser (pentesting, PDF) | Managed via Docker |

 ---

@@ -0,0 +1,61 @@
+# Finding Deduplication
+
+The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.
+
+## SAST Finding Dedup
+
+Static analysis findings are deduplicated using SHA-256 fingerprints computed from:
+
+- Repository ID
+- Scanner rule ID (e.g., Semgrep check ID)
+- File path
+- Line number
+
+Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.
+
+## DAST / Pentest Finding Dedup
+
+Dynamic testing findings go through two-phase deduplication:
+
+### Phase 1: Exact Dedup
+
+Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.
+
+**Title canonicalization** handles common variations:
+- Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
+- Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")
+
+### Phase 2: CWE-Based Dedup
+
+After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both `security_header_missing` and `tls_misconfiguration`).
+
+The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.
+
+### When Dedup Applies
+
+- **At insertion time**: During a pentest session, before each finding is stored in MongoDB
+- **At report export**: When generating a pentest report, all session findings are deduplicated before rendering
+
+## PR Review Comment Dedup
+
+PR review comments are deduplicated to prevent posting the same finding multiple times:
+
+- Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
+- Within a single review run, duplicate findings are skipped
+- The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup
+
+## Issue Tracker Dedup
+
+Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:
+
+1. Searches for an existing issue matching the finding's fingerprint
+2. Falls back to searching by issue title
+3. Skips creation if a match is found
+
+## Code Review Dedup
+
+Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:
+
+- Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
+- The finding with the highest severity is kept
+- CWE information is merged from duplicates
@@ -0,0 +1,60 @@
+# Help Chat Assistant
+
+The Help Chat is a floating assistant available on every page of the dashboard. It answers questions about the Compliance Scanner using the project documentation as its knowledge base.
+
+## How It Works
+
+1. Click the **?** button in the bottom-right corner of any page
+2. Type your question and press Enter
+3. The assistant responds with answers grounded in the project documentation
+
+The chat supports multi-turn conversations -- you can ask follow-up questions and the assistant will remember the context of your conversation.
+
+## What You Can Ask
+
+- **Getting started**: "How do I add a repository?" / "How do I trigger a scan?"
+- **Features**: "What is SBOM?" / "How does the code knowledge graph work?"
+- **Configuration**: "How do I set up webhooks?" / "What environment variables are needed?"
+- **Scanning**: "What does the scan pipeline do?" / "How does LLM triage work?"
+- **DAST & Pentesting**: "How do I run a pentest?" / "What DAST tools are available?"
+- **Integrations**: "How do I connect to GitHub?" / "What is MCP?"
+
+## Technical Details
+
+The help chat loads all project documentation (README, guides, feature docs, reference) at startup and caches them in memory. When you ask a question, it sends your message along with the full documentation context to the LLM via LiteLLM, which generates a grounded response.
+
+### API Endpoint
+
+```
+POST /api/v1/help/chat
+Content-Type: application/json
+
+{
+  "message": "How do I add a repository?",
+  "history": [
+    { "role": "user", "content": "previous question" },
+    { "role": "assistant", "content": "previous answer" }
+  ]
+}
+```
+
+### Configuration
+
+The help chat uses the same LiteLLM configuration as other LLM features:
+
+| Environment Variable | Description | Default |
+|---------------------|-------------|---------|
+| `LITELLM_URL` | LiteLLM API base URL | `http://localhost:4000` |
+| `LITELLM_MODEL` | Model for chat responses | `gpt-4o` |
+| `LITELLM_API_KEY` | API key (optional) | -- |
+
+### Documentation Sources
+
+The assistant indexes the following documentation at startup:
+
+- `README.md` -- Project overview and quick start
+- `docs/guide/` -- Getting started, repositories, findings, SBOM, scanning, issues, webhooks
+- `docs/features/` -- AI Chat, DAST, Code Graph, MCP Server, Pentesting, Help Chat
+- `docs/reference/` -- Glossary, tools reference
+
+If documentation files are not found at startup (e.g., in a minimal Docker deployment), the assistant falls back to general knowledge about the project.
@@ -1,8 +1,6 @@
 # Dashboard Overview

-The Overview page is the landing page of Certifai. It gives you a high-level view of your security posture across all tracked repositories.
-
-![Dashboard overview with stats cards, severity distribution, AI chat, and MCP servers](/screenshots/dashboard-overview.png)
+The Overview page is the landing page of the Compliance Scanner. It gives you a high-level view of your security posture across all tracked repositories.

 ## Stats Cards

@@ -34,6 +32,10 @@ The overview includes quick-access cards for the AI Chat feature. Each card repr

 If you have MCP servers registered, they appear on the overview page with their status and connection details. This lets you quickly check that your MCP integrations are running. See [MCP Integration](/features/mcp-server) for details.

+## Help Chat Assistant
+
+A floating help chat button is available in the bottom-right corner of every page. Click it to ask questions about the Compliance Scanner -- how to configure repositories, understand findings, set up webhooks, or use any feature. The assistant is grounded in the project documentation and uses LiteLLM for responses.
+
 ## Recent Scan Runs

 The bottom section lists the most recent scan runs across all repositories, showing: