compliance-scanner-agent/docs/features/deduplication.md

# Finding Deduplication

The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.

## SAST Finding Dedup

Static analysis findings are deduplicated using SHA-256 fingerprints computed from:

- Repository ID
- Scanner rule ID (e.g., Semgrep check ID)
- File path
- Line number

Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.

## DAST / Pentest Finding Dedup

Dynamic testing findings go through two-phase deduplication:

### Phase 1: Exact Dedup

Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.

**Title canonicalization** handles common variations:
- Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
- Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")

### Phase 2: CWE-Based Dedup

After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both `security_header_missing` and `tls_misconfiguration`).

The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.

### When Dedup Applies

- **At insertion time**: During a pentest session, before each finding is stored in MongoDB
- **At report export**: When generating a pentest report, all session findings are deduplicated before rendering

## PR Review Comment Dedup

PR review comments are deduplicated to prevent posting the same finding multiple times:

- Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
- Within a single review run, duplicate findings are skipped
- The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup

## Issue Tracker Dedup

Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:

1. Searches for an existing issue matching the finding's fingerprint
2. Falls back to searching by issue title
3. Skips creation if a match is found

## Code Review Dedup

Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:

- Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
- The finding with the highest severity is kept
- CWE information is merged from duplicates