docs: update README and add help-chat, deduplication docs
All checks were successful
CI / Check (pull_request) Successful in 13m17s
CI / Detect Changes (pull_request) Has been skipped
CI / Deploy Agent (pull_request) Has been skipped
CI / Deploy Dashboard (pull_request) Has been skipped
CI / Deploy Docs (pull_request) Has been skipped
CI / Deploy MCP (pull_request) Has been skipped
All checks were successful
CI / Check (pull_request) Successful in 13m17s
CI / Detect Changes (pull_request) Has been skipped
CI / Deploy Agent (pull_request) Has been skipped
CI / Deploy Dashboard (pull_request) Has been skipped
CI / Deploy Docs (pull_request) Has been skipped
CI / Deploy MCP (pull_request) Has been skipped
README.md: - Add DAST, pentesting, code graph, AI chat, MCP, help chat to features table - Add Gitea to tracker list, multi-language LLM triage note - Update architecture diagram with all 5 workspace crates - Add new API endpoints (graph, DAST, chat, help, pentest) - Update dashboard pages table (remove Settings, add 6 new pages) - Update project structure with new directories - Add Keycloak, Chromium to external services New docs: - docs/features/help-chat.md — Help chat assistant usage, API, config - docs/features/deduplication.md — Finding dedup across SAST, DAST, PR, issues Updated: - docs/features/overview.md — Add help chat section, update tracker list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
120
README.md
120
README.md
@@ -28,9 +28,9 @@
|
||||
|
||||
## About
|
||||
|
||||
Compliance Scanner is an autonomous agent that continuously monitors git repositories for security vulnerabilities, GDPR/OAuth compliance patterns, and dependency risks. It creates issues in external trackers (GitHub/GitLab/Jira) with evidence and remediation suggestions, reviews pull requests, and exposes a Dioxus-based dashboard for visualization.
|
||||
Compliance Scanner is an autonomous agent that continuously monitors git repositories for security vulnerabilities, GDPR/OAuth compliance patterns, and dependency risks. It creates issues in external trackers (GitHub/GitLab/Jira/Gitea) with evidence and remediation suggestions, reviews pull requests with multi-pass LLM analysis, runs autonomous penetration tests, and exposes a Dioxus-based dashboard for visualization.
|
||||
|
||||
> **How it works:** The agent runs as a lazy daemon -- it only scans when new commits are detected, triggered by cron schedules or webhooks. LLM-powered triage filters out false positives and generates actionable remediation.
|
||||
> **How it works:** The agent runs as a lazy daemon -- it only scans when new commits are detected, triggered by cron schedules or webhooks. LLM-powered triage filters out false positives and generates actionable remediation with multi-language awareness.
|
||||
|
||||
## Features
|
||||
|
||||
@@ -41,31 +41,38 @@ Compliance Scanner is an autonomous agent that continuously monitors git reposit
|
||||
| **CVE Monitoring** | OSV.dev batch queries, NVD CVSS enrichment, SearXNG context |
|
||||
| **GDPR Patterns** | Detect PII logging, missing consent, hardcoded retention, missing deletion |
|
||||
| **OAuth Patterns** | Detect implicit grant, missing PKCE, token in localStorage, token in URLs |
|
||||
| **LLM Triage** | Confidence scoring via LiteLLM to filter false positives |
|
||||
| **Issue Creation** | Auto-create issues in GitHub, GitLab, or Jira with code evidence |
|
||||
| **PR Reviews** | Post security review comments on pull requests |
|
||||
| **Dashboard** | Fullstack Dioxus UI with findings, SBOM, issues, and statistics |
|
||||
| **Webhooks** | GitHub (HMAC-SHA256) and GitLab webhook receivers for push/PR events |
|
||||
| **LLM Triage** | Multi-language-aware confidence scoring (Rust, Python, Go, Java, Ruby, PHP, C++) |
|
||||
| **Issue Creation** | Auto-create issues in GitHub, GitLab, Jira, or Gitea with dedup via fingerprints |
|
||||
| **PR Reviews** | Multi-pass security review (logic, security, convention, complexity) with dedup |
|
||||
| **DAST Scanning** | Black-box security testing with endpoint discovery and parameter fuzzing |
|
||||
| **AI Pentesting** | Autonomous LLM-orchestrated penetration testing with encrypted reports |
|
||||
| **Code Graph** | Interactive code knowledge graph with impact analysis |
|
||||
| **AI Chat (RAG)** | Natural language Q&A grounded in repository source code |
|
||||
| **Help Assistant** | Documentation-grounded help chat accessible from every dashboard page |
|
||||
| **MCP Server** | Expose live security data to Claude, Cursor, and other AI tools |
|
||||
| **Dashboard** | Fullstack Dioxus UI with findings, SBOM, issues, DAST, pentest, and graph |
|
||||
| **Webhooks** | GitHub, GitLab, and Gitea webhook receivers for push/PR events |
|
||||
| **Finding Dedup** | SHA-256 fingerprint dedup for SAST, CWE-based dedup for DAST findings |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Cargo Workspace │
|
||||
├──────────────┬──────────────────┬───────────────────────────┤
|
||||
│ compliance- │ compliance- │ compliance- │
|
||||
│ core │ agent │ dashboard │
|
||||
│ (lib) │ (bin) │ (bin, Dioxus 0.7.3) │
|
||||
│ │ │ │
|
||||
│ Models │ Scan Pipeline │ Fullstack Web UI │
|
||||
│ Traits │ LLM Client │ Server Functions │
|
||||
│ Config │ Issue Trackers │ Charts + Tables │
|
||||
│ Errors │ Scheduler │ Settings Page │
|
||||
│ │ REST API │ │
|
||||
│ │ Webhooks │ │
|
||||
└──────────────┴──────────────────┴───────────────────────────┘
|
||||
│
|
||||
MongoDB (shared)
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Cargo Workspace │
|
||||
├──────────────┬──────────────────┬──────────────┬──────────┬─────────────┤
|
||||
│ compliance- │ compliance- │ compliance- │ complian-│ compliance- │
|
||||
│ core (lib) │ agent (bin) │ dashboard │ ce-graph │ mcp (bin) │
|
||||
│ │ │ (bin) │ (lib) │ │
|
||||
│ Models │ Scan Pipeline │ Dioxus 0.7 │ Tree- │ MCP Server │
|
||||
│ Traits │ LLM Client │ Fullstack UI │ sitter │ Live data │
|
||||
│ Config │ Issue Trackers │ Help Chat │ Graph │ for AI │
|
||||
│ Errors │ Pentest Engine │ Server Fns │ Embedds │ tools │
|
||||
│ │ DAST Tools │ │ RAG │ │
|
||||
│ │ REST API │ │ │ │
|
||||
│ │ Webhooks │ │ │ │
|
||||
└──────────────┴──────────────────┴──────────────┴──────────┴─────────────┘
|
||||
│
|
||||
MongoDB (shared)
|
||||
```
|
||||
|
||||
## Scan Pipeline (7 Stages)
|
||||
@@ -84,11 +91,16 @@ Compliance Scanner is an autonomous agent that continuously monitors git reposit
|
||||
|-------|-----------|
|
||||
| Shared Library | `compliance-core` -- models, traits, config |
|
||||
| Agent | Axum REST API, git2, tokio-cron-scheduler, Semgrep, Syft |
|
||||
| Dashboard | Dioxus 0.7.3 fullstack, Tailwind CSS |
|
||||
| Dashboard | Dioxus 0.7.3 fullstack, Tailwind CSS 4 |
|
||||
| Code Graph | `compliance-graph` -- tree-sitter parsing, embeddings, RAG |
|
||||
| MCP Server | `compliance-mcp` -- Model Context Protocol for AI tools |
|
||||
| DAST | `compliance-dast` -- dynamic application security testing |
|
||||
| Database | MongoDB with typed collections |
|
||||
| LLM | LiteLLM (OpenAI-compatible API) |
|
||||
| Issue Trackers | GitHub (octocrab), GitLab (REST v4), Jira (REST v3) |
|
||||
| LLM | LiteLLM (OpenAI-compatible API for chat, triage, embeddings) |
|
||||
| Issue Trackers | GitHub (octocrab), GitLab (REST v4), Jira (REST v3), Gitea |
|
||||
| CVE Sources | OSV.dev, NVD, SearXNG |
|
||||
| Auth | Keycloak (OAuth2/PKCE, SSO) |
|
||||
| Browser Automation | Chromium (headless, for pentesting and PDF generation) |
|
||||
|
||||
## Getting Started
|
||||
|
||||
@@ -151,20 +163,35 @@ The agent exposes a REST API on port 3001:
|
||||
| `GET` | `/api/v1/sbom` | List dependencies |
|
||||
| `GET` | `/api/v1/issues` | List cross-tracker issues |
|
||||
| `GET` | `/api/v1/scan-runs` | Scan execution history |
|
||||
| `GET` | `/api/v1/graph/:repo_id` | Code knowledge graph |
|
||||
| `POST` | `/api/v1/graph/:repo_id/build` | Trigger graph build |
|
||||
| `GET` | `/api/v1/dast/targets` | List DAST targets |
|
||||
| `POST` | `/api/v1/dast/targets` | Add DAST target |
|
||||
| `GET` | `/api/v1/dast/findings` | List DAST findings |
|
||||
| `POST` | `/api/v1/chat/:repo_id` | RAG-powered code chat |
|
||||
| `POST` | `/api/v1/help/chat` | Documentation-grounded help chat |
|
||||
| `POST` | `/api/v1/pentest/sessions` | Create pentest session |
|
||||
| `POST` | `/api/v1/pentest/sessions/:id/export` | Export encrypted pentest report |
|
||||
| `POST` | `/webhook/github` | GitHub webhook (HMAC-SHA256) |
|
||||
| `POST` | `/webhook/gitlab` | GitLab webhook (token verify) |
|
||||
| `POST` | `/webhook/gitea` | Gitea webhook |
|
||||
|
||||
## Dashboard Pages
|
||||
|
||||
| Page | Description |
|
||||
|------|-------------|
|
||||
| **Overview** | Stat cards, severity distribution chart |
|
||||
| **Repositories** | Add/manage tracked repos, trigger scans |
|
||||
| **Findings** | Filterable table by severity, type, status |
|
||||
| **Overview** | Stat cards, severity distribution, AI chat cards, MCP status |
|
||||
| **Repositories** | Add/manage tracked repos, trigger scans, webhook config |
|
||||
| **Findings** | Filterable table by severity, type, status, scanner |
|
||||
| **Finding Detail** | Code evidence, remediation, suggested fix, linked issue |
|
||||
| **SBOM** | Dependency inventory with vulnerability badges |
|
||||
| **Issues** | Cross-tracker view (GitHub + GitLab + Jira) |
|
||||
| **Settings** | Configure LiteLLM, tracker tokens, SearXNG URL |
|
||||
| **SBOM** | Dependency inventory with vulnerability badges, license summary |
|
||||
| **Issues** | Cross-tracker view (GitHub + GitLab + Jira + Gitea) |
|
||||
| **Code Graph** | Interactive architecture visualization, impact analysis |
|
||||
| **AI Chat** | RAG-powered Q&A about repository code |
|
||||
| **DAST** | Dynamic scanning targets, findings, and scan history |
|
||||
| **Pentest** | AI-driven pentest sessions, attack chain visualization |
|
||||
| **MCP Servers** | Model Context Protocol server management |
|
||||
| **Help Chat** | Floating assistant (available on every page) for product Q&A |
|
||||
|
||||
## Project Structure
|
||||
|
||||
@@ -173,19 +200,24 @@ compliance-scanner/
|
||||
├── compliance-core/ Shared library (models, traits, config, errors)
|
||||
├── compliance-agent/ Agent daemon (pipeline, LLM, trackers, API, webhooks)
|
||||
│ └── src/
|
||||
│ ├── pipeline/ 7-stage scan pipeline
|
||||
│ ├── llm/ LiteLLM client, triage, descriptions, fixes, PR review
|
||||
│ ├── trackers/ GitHub, GitLab, Jira integrations
|
||||
│ ├── api/ REST API (Axum)
|
||||
│ └── webhooks/ GitHub + GitLab webhook receivers
|
||||
│ ├── pipeline/ 7-stage scan pipeline, dedup, PR reviews, code review
|
||||
│ ├── llm/ LiteLLM client, triage, descriptions, fixes, review prompts
|
||||
│ ├── trackers/ GitHub, GitLab, Jira, Gitea integrations
|
||||
│ ├── pentest/ AI-driven pentest orchestrator, tools, reports
|
||||
│ ├── rag/ RAG pipeline, chunking, embedding
|
||||
│ ├── api/ REST API (Axum), help chat
|
||||
│ └── webhooks/ GitHub, GitLab, Gitea webhook receivers
|
||||
├── compliance-dashboard/ Dioxus fullstack dashboard
|
||||
│ └── src/
|
||||
│ ├── components/ Reusable UI components
|
||||
│ ├── infrastructure/ Server functions, DB, config
|
||||
│ └── pages/ Full page views
|
||||
│ ├── components/ Reusable UI (sidebar, help chat, attack chain, etc.)
|
||||
│ ├── infrastructure/ Server functions, DB, config, auth
|
||||
│ └── pages/ Full page views (overview, DAST, pentest, graph, etc.)
|
||||
├── compliance-graph/ Code knowledge graph (tree-sitter, embeddings, RAG)
|
||||
├── compliance-dast/ Dynamic application security testing
|
||||
├── compliance-mcp/ Model Context Protocol server
|
||||
├── docs/ VitePress documentation site
|
||||
├── assets/ Static assets (CSS, icons)
|
||||
├── styles/ Tailwind input stylesheet
|
||||
└── bin/ Dashboard binary entrypoint
|
||||
└── styles/ Tailwind input stylesheet
|
||||
```
|
||||
|
||||
## External Services
|
||||
@@ -193,10 +225,12 @@ compliance-scanner/
|
||||
| Service | Purpose | Default URL |
|
||||
|---------|---------|-------------|
|
||||
| MongoDB | Persistence | `mongodb://localhost:27017` |
|
||||
| LiteLLM | LLM proxy for triage and generation | `http://localhost:4000` |
|
||||
| LiteLLM | LLM proxy (chat, triage, embeddings) | `http://localhost:4000` |
|
||||
| SearXNG | CVE context search | `http://localhost:8888` |
|
||||
| Keycloak | Authentication (OAuth2/PKCE, SSO) | `http://localhost:8080` |
|
||||
| Semgrep | SAST scanning | CLI tool |
|
||||
| Syft | SBOM generation | CLI tool |
|
||||
| Chromium | Headless browser (pentesting, PDF) | Managed via Docker |
|
||||
|
||||
---
|
||||
|
||||
|
||||
61
docs/features/deduplication.md
Normal file
61
docs/features/deduplication.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# Finding Deduplication
|
||||
|
||||
The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.
|
||||
|
||||
## SAST Finding Dedup
|
||||
|
||||
Static analysis findings are deduplicated using SHA-256 fingerprints computed from:
|
||||
|
||||
- Repository ID
|
||||
- Scanner rule ID (e.g., Semgrep check ID)
|
||||
- File path
|
||||
- Line number
|
||||
|
||||
Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.
|
||||
|
||||
## DAST / Pentest Finding Dedup
|
||||
|
||||
Dynamic testing findings go through two-phase deduplication:
|
||||
|
||||
### Phase 1: Exact Dedup
|
||||
|
||||
Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.
|
||||
|
||||
**Title canonicalization** handles common variations:
|
||||
- Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
|
||||
- Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")
|
||||
|
||||
### Phase 2: CWE-Based Dedup
|
||||
|
||||
After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both `security_header_missing` and `tls_misconfiguration`).
|
||||
|
||||
The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.
|
||||
|
||||
### When Dedup Applies
|
||||
|
||||
- **At insertion time**: During a pentest session, before each finding is stored in MongoDB
|
||||
- **At report export**: When generating a pentest report, all session findings are deduplicated before rendering
|
||||
|
||||
## PR Review Comment Dedup
|
||||
|
||||
PR review comments are deduplicated to prevent posting the same finding multiple times:
|
||||
|
||||
- Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
|
||||
- Within a single review run, duplicate findings are skipped
|
||||
- The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup
|
||||
|
||||
## Issue Tracker Dedup
|
||||
|
||||
Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:
|
||||
|
||||
1. Searches for an existing issue matching the finding's fingerprint
|
||||
2. Falls back to searching by issue title
|
||||
3. Skips creation if a match is found
|
||||
|
||||
## Code Review Dedup
|
||||
|
||||
Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:
|
||||
|
||||
- Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
|
||||
- The finding with the highest severity is kept
|
||||
- CWE information is merged from duplicates
|
||||
60
docs/features/help-chat.md
Normal file
60
docs/features/help-chat.md
Normal file
@@ -0,0 +1,60 @@
|
||||
# Help Chat Assistant
|
||||
|
||||
The Help Chat is a floating assistant available on every page of the dashboard. It answers questions about the Compliance Scanner using the project documentation as its knowledge base.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. Click the **?** button in the bottom-right corner of any page
|
||||
2. Type your question and press Enter
|
||||
3. The assistant responds with answers grounded in the project documentation
|
||||
|
||||
The chat supports multi-turn conversations -- you can ask follow-up questions and the assistant will remember the context of your conversation.
|
||||
|
||||
## What You Can Ask
|
||||
|
||||
- **Getting started**: "How do I add a repository?" / "How do I trigger a scan?"
|
||||
- **Features**: "What is SBOM?" / "How does the code knowledge graph work?"
|
||||
- **Configuration**: "How do I set up webhooks?" / "What environment variables are needed?"
|
||||
- **Scanning**: "What does the scan pipeline do?" / "How does LLM triage work?"
|
||||
- **DAST & Pentesting**: "How do I run a pentest?" / "What DAST tools are available?"
|
||||
- **Integrations**: "How do I connect to GitHub?" / "What is MCP?"
|
||||
|
||||
## Technical Details
|
||||
|
||||
The help chat loads all project documentation (README, guides, feature docs, reference) at startup and caches them in memory. When you ask a question, it sends your message along with the full documentation context to the LLM via LiteLLM, which generates a grounded response.
|
||||
|
||||
### API Endpoint
|
||||
|
||||
```
|
||||
POST /api/v1/help/chat
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"message": "How do I add a repository?",
|
||||
"history": [
|
||||
{ "role": "user", "content": "previous question" },
|
||||
{ "role": "assistant", "content": "previous answer" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
The help chat uses the same LiteLLM configuration as other LLM features:
|
||||
|
||||
| Environment Variable | Description | Default |
|
||||
|---------------------|-------------|---------|
|
||||
| `LITELLM_URL` | LiteLLM API base URL | `http://localhost:4000` |
|
||||
| `LITELLM_MODEL` | Model for chat responses | `gpt-4o` |
|
||||
| `LITELLM_API_KEY` | API key (optional) | -- |
|
||||
|
||||
### Documentation Sources
|
||||
|
||||
The assistant indexes the following documentation at startup:
|
||||
|
||||
- `README.md` -- Project overview and quick start
|
||||
- `docs/guide/` -- Getting started, repositories, findings, SBOM, scanning, issues, webhooks
|
||||
- `docs/features/` -- AI Chat, DAST, Code Graph, MCP Server, Pentesting, Help Chat
|
||||
- `docs/reference/` -- Glossary, tools reference
|
||||
|
||||
If documentation files are not found at startup (e.g., in a minimal Docker deployment), the assistant falls back to general knowledge about the project.
|
||||
@@ -1,8 +1,6 @@
|
||||
# Dashboard Overview
|
||||
|
||||
The Overview page is the landing page of Certifai. It gives you a high-level view of your security posture across all tracked repositories.
|
||||
|
||||

|
||||
The Overview page is the landing page of the Compliance Scanner. It gives you a high-level view of your security posture across all tracked repositories.
|
||||
|
||||
## Stats Cards
|
||||
|
||||
@@ -34,6 +32,10 @@ The overview includes quick-access cards for the AI Chat feature. Each card repr
|
||||
|
||||
If you have MCP servers registered, they appear on the overview page with their status and connection details. This lets you quickly check that your MCP integrations are running. See [MCP Integration](/features/mcp-server) for details.
|
||||
|
||||
## Help Chat Assistant
|
||||
|
||||
A floating help chat button is available in the bottom-right corner of every page. Click it to ask questions about the Compliance Scanner -- how to configure repositories, understand findings, set up webhooks, or use any feature. The assistant is grounded in the project documentation and uses LiteLLM for responses.
|
||||
|
||||
## Recent Scan Runs
|
||||
|
||||
The bottom section lists the most recent scan runs across all repositories, showing:
|
||||
|
||||
Reference in New Issue
Block a user