62 lines
2.7 KiB
Markdown
62 lines
2.7 KiB
Markdown
# Finding Deduplication
|
|
|
|
The Compliance Scanner automatically deduplicates findings across all scanning surfaces to prevent noise and duplicate issues.
|
|
|
|
## SAST Finding Dedup
|
|
|
|
Static analysis findings are deduplicated using SHA-256 fingerprints computed from:
|
|
|
|
- Repository ID
|
|
- Scanner rule ID (e.g., Semgrep check ID)
|
|
- File path
|
|
- Line number
|
|
|
|
Before inserting a new finding, the pipeline checks if a finding with the same fingerprint already exists. If it does, the finding is skipped.
|
|
|
|
## DAST / Pentest Finding Dedup
|
|
|
|
Dynamic testing findings go through two-phase deduplication:
|
|
|
|
### Phase 1: Exact Dedup
|
|
|
|
Findings with the same canonicalized title, endpoint, and HTTP method are merged. Evidence from duplicate findings is combined into a single finding, keeping the highest severity.
|
|
|
|
**Title canonicalization** handles common variations:
|
|
- Domain names and URLs are stripped from titles (e.g., "Missing HSTS header for example.com" becomes "Missing HSTS header")
|
|
- Known synonyms are resolved (e.g., "HSTS" maps to "strict-transport-security", "CSP" maps to "content-security-policy")
|
|
|
|
### Phase 2: CWE-Based Dedup
|
|
|
|
After exact dedup, findings with the same CWE and endpoint are merged. This catches cases where different tools report the same underlying issue with different titles or vulnerability types (e.g., a missing HSTS header reported as both `security_header_missing` and `tls_misconfiguration`).
|
|
|
|
The primary finding is selected by highest severity, then most evidence, then longest description. Evidence from merged findings is preserved.
|
|
|
|
### When Dedup Applies
|
|
|
|
- **At insertion time**: During a pentest session, before each finding is stored in MongoDB
|
|
- **At report export**: When generating a pentest report, all session findings are deduplicated before rendering
|
|
|
|
## PR Review Comment Dedup
|
|
|
|
PR review comments are deduplicated to prevent posting the same finding multiple times:
|
|
|
|
- Each comment includes a fingerprint computed from the repository, PR number, file path, line, and finding title
|
|
- Within a single review run, duplicate findings are skipped
|
|
- The fingerprint is embedded as an HTML comment in the review body for future cross-run dedup
|
|
|
|
## Issue Tracker Dedup
|
|
|
|
Before creating an issue in GitHub, GitLab, Jira, or Gitea, the scanner:
|
|
|
|
1. Searches for an existing issue matching the finding's fingerprint
|
|
2. Falls back to searching by issue title
|
|
3. Skips creation if a match is found
|
|
|
|
## Code Review Dedup
|
|
|
|
Multi-pass LLM code reviews (logic, security, convention, complexity) are deduplicated across passes using proximity-aware keys:
|
|
|
|
- Findings within 3 lines of each other on the same file with similar normalized titles are considered duplicates
|
|
- The finding with the highest severity is kept
|
|
- CWE information is merged from duplicates
|