README.md: - Add DAST, pentesting, code graph, AI chat, MCP, help chat to features table - Add Gitea to tracker list, multi-language LLM triage note - Update architecture diagram with all 5 workspace crates - Add new API endpoints (graph, DAST, chat, help, pentest) - Update dashboard pages table (remove Settings, add 6 new pages) - Update project structure with new directories - Add Keycloak, Chromium to external services New docs: - docs/features/help-chat.md — Help chat assistant usage, API, config - docs/features/deduplication.md — Finding dedup across SAST, DAST, PR, issues Updated: - docs/features/overview.md — Add help chat section, update tracker list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.7 KiB
Finding Deduplication
The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.
SAST Finding Dedup
Static analysis findings are deduplicated using SHA-256 fingerprints computed from:
- Repository ID
- Scanner rule ID (e.g., Semgrep check ID)
- File path
- Line number
Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.
DAST / Pentest Finding Dedup
Dynamic testing findings go through two-phase deduplication:
Phase 1: Exact Dedup
Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.
Title canonicalization handles common variations:
- Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
- Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")
Phase 2: CWE-Based Dedup
After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both security_header_missing and tls_misconfiguration).
The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.
When Dedup Applies
- At insertion time: During a pentest session, before each finding is stored in MongoDB
- At report export: When generating a pentest report, all session findings are deduplicated before rendering
PR Review Comment Dedup
PR review comments are deduplicated to prevent posting the same finding multiple times:
- Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
- Within a single review run, duplicate findings are skipped
- The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup
Issue Tracker Dedup
Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:
- Searches for an existing issue matching the finding's fingerprint
- Falls back to searching by issue title
- Skips creation if a match is found
Code Review Dedup
Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:
- Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
- The finding with the highest severity is kept
- CWE information is merged from duplicates