Files
compliance-scanner-agent/docs/features/deduplication.md
Sharang Parnerkar 4d7efea683
All checks were successful
CI / Check (pull_request) Successful in 13m17s
CI / Detect Changes (pull_request) Has been skipped
CI / Deploy Agent (pull_request) Has been skipped
CI / Deploy Dashboard (pull_request) Has been skipped
CI / Deploy Docs (pull_request) Has been skipped
CI / Deploy MCP (pull_request) Has been skipped
docs: update README and add help-chat, deduplication docs
README.md:
- Add DAST, pentesting, code graph, AI chat, MCP, help chat to features table
- Add Gitea to tracker list, multi-language LLM triage note
- Update architecture diagram with all 5 workspace crates
- Add new API endpoints (graph, DAST, chat, help, pentest)
- Update dashboard pages table (remove Settings, add 6 new pages)
- Update project structure with new directories
- Add Keycloak, Chromium to external services

New docs:
- docs/features/help-chat.md — Help chat assistant usage, API, config
- docs/features/deduplication.md — Finding dedup across SAST, DAST, PR, issues

Updated:
- docs/features/overview.md — Add help chat section, update tracker list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 09:49:11 +02:00

2.7 KiB

Finding Deduplication

The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.

SAST Finding Dedup

Static analysis findings are deduplicated using SHA-256 fingerprints computed from:

  • Repository ID
  • Scanner rule ID (e.g., Semgrep check ID)
  • File path
  • Line number

Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.

DAST / Pentest Finding Dedup

Dynamic testing findings go through two-phase deduplication:

Phase 1: Exact Dedup

Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.

Title canonicalization handles common variations:

  • Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
  • Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")

Phase 2: CWE-Based Dedup

After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both security_header_missing and tls_misconfiguration).

The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.

When Dedup Applies

  • At insertion time: During a pentest session, before each finding is stored in MongoDB
  • At report export: When generating a pentest report, all session findings are deduplicated before rendering

PR Review Comment Dedup

PR review comments are deduplicated to prevent posting the same finding multiple times:

  • Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
  • Within a single review run, duplicate findings are skipped
  • The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup

Issue Tracker Dedup

Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:

  1. Searches for an existing issue matching the finding's fingerprint
  2. Falls back to searching by issue title
  3. Skips creation if a match is found

Code Review Dedup

Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:

  • Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
  • The finding with the highest severity is kept
  • CWE information is merged from duplicates