docs: update README and add help-chat, deduplication docs

README.md: - Add DAST, pentesting, code graph, AI chat, MCP, help chat to features table - Add Gitea to tracker list, multi-language LLM triage note - Update architecture diagram with all 5 workspace crates - Add new API endpoints (graph, DAST, chat, help, pentest) - Update dashboard pages table (remove Settings, add 6 new pages) - Update project structure with new directories - Add Keycloak, Chromium to external services New docs: - docs/features/help-chat.md — Help chat assistant usage, API, config - docs/features/deduplication.md — Finding dedup across SAST, DAST, PR, issues Updated: - docs/features/overview.md — Add help chat section, update tracker list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 09:49:11 +02:00
parent 263a4e654a
commit 4d7efea683
4 changed files with 203 additions and 46 deletions
@@ -0,0 +1,61 @@
+# Finding Deduplication
+
+The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.
+
+## SAST Finding Dedup
+
+Static analysis findings are deduplicated using SHA-256 fingerprints computed from:
+
+- Repository ID
+- Scanner rule ID (e.g., Semgrep check ID)
+- File path
+- Line number
+
+Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.
+
+## DAST / Pentest Finding Dedup
+
+Dynamic testing findings go through two-phase deduplication:
+
+### Phase 1: Exact Dedup
+
+Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.
+
+**Title canonicalization** handles common variations:
+- Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
+- Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")
+
+### Phase 2: CWE-Based Dedup
+
+After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both `security_header_missing` and `tls_misconfiguration`).
+
+The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.
+
+### When Dedup Applies
+
+- **At insertion time**: During a pentest session, before each finding is stored in MongoDB
+- **At report export**: When generating a pentest report, all session findings are deduplicated before rendering
+
+## PR Review Comment Dedup
+
+PR review comments are deduplicated to prevent posting the same finding multiple times:
+
+- Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
+- Within a single review run, duplicate findings are skipped
+- The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup
+
+## Issue Tracker Dedup
+
+Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:
+
+1. Searches for an existing issue matching the finding's fingerprint
+2. Falls back to searching by issue title
+3. Skips creation if a match is found
+
+## Code Review Dedup
+
+Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:
+
+- Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
+- The finding with the highest severity is kept
+- CWE information is merged from duplicates
@@ -0,0 +1,60 @@
+# Help Chat Assistant
+
+The Help Chat is a floating assistant available on every page of the dashboard. It answers questions about the Compliance Scanner using the project documentation as its knowledge base.
+
+## How It Works
+
+1. Click the **?** button in the bottom-right corner of any page
+2. Type your question and press Enter
+3. The assistant responds with answers grounded in the project documentation
+
+The chat supports multi-turn conversations -- you can ask follow-up questions and the assistant will remember the context of your conversation.
+
+## What You Can Ask
+
+- **Getting started**: "How do I add a repository?" / "How do I trigger a scan?"
+- **Features**: "What is SBOM?" / "How does the code knowledge graph work?"
+- **Configuration**: "How do I set up webhooks?" / "What environment variables are needed?"
+- **Scanning**: "What does the scan pipeline do?" / "How does LLM triage work?"
+- **DAST & Pentesting**: "How do I run a pentest?" / "What DAST tools are available?"
+- **Integrations**: "How do I connect to GitHub?" / "What is MCP?"
+
+## Technical Details
+
+The help chat loads all project documentation (README, guides, feature docs, reference) at startup and caches them in memory. When you ask a question, it sends your message along with the full documentation context to the LLM via LiteLLM, which generates a grounded response.
+
+### API Endpoint
+
+```
+POST /api/v1/help/chat
+Content-Type: application/json
+
+{
+  "message": "How do I add a repository?",
+  "history": [
+    { "role": "user", "content": "previous question" },
+    { "role": "assistant", "content": "previous answer" }
+  ]
+}
+```
+
+### Configuration
+
+The help chat uses the same LiteLLM configuration as other LLM features:
+
+| Environment Variable | Description | Default |
+|---------------------|-------------|---------|
+| `LITELLM_URL` | LiteLLM API base URL | `http://localhost:4000` |
+| `LITELLM_MODEL` | Model for chat responses | `gpt-4o` |
+| `LITELLM_API_KEY` | API key (optional) | -- |
+
+### Documentation Sources
+
+The assistant indexes the following documentation at startup:
+
+- `README.md` -- Project overview and quick start
+- `docs/guide/` -- Getting started, repositories, findings, SBOM, scanning, issues, webhooks
+- `docs/features/` -- AI Chat, DAST, Code Graph, MCP Server, Pentesting, Help Chat
+- `docs/reference/` -- Glossary, tools reference
+
+If documentation files are not found at startup (e.g., in a minimal Docker deployment), the assistant falls back to general knowledge about the project.
@@ -1,8 +1,6 @@
 # Dashboard Overview

-The Overview page is the landing page of Certifai. It gives you a high-level view of your security posture across all tracked repositories.
-
-![Dashboard overview with stats cards, severity distribution, AI chat, and MCP servers](/screenshots/dashboard-overview.png)
+The Overview page is the landing page of the Compliance Scanner. It gives you a high-level view of your security posture across all tracked repositories.

 ## Stats Cards

@@ -34,6 +32,10 @@ The overview includes quick-access cards for the AI Chat feature. Each card repr

 If you have MCP servers registered, they appear on the overview page with their status and connection details. This lets you quickly check that your MCP integrations are running. See [MCP Integration](/features/mcp-server) for details.

+## Help Chat Assistant
+
+A floating help chat button is available in the bottom-right corner of every page. Click it to ask questions about the Compliance Scanner -- how to configure repositories, understand findings, set up webhooks, or use any feature. The assistant is grounded in the project documentation and uses LiteLLM for responses.
+
 ## Recent Scan Runs

 The bottom section lists the most recent scan runs across all repositories, showing: