feat: deduplicate code review findings across LLM passes #48

Merged
sharang merged 2 commits from feat/dedup-code-review into main 2026-03-29 20:38:53 +00:00
5 changed files with 486 additions and 5 deletions

View File

@@ -95,8 +95,8 @@ pub async fn export_session_report(
Err(_) => Vec::new(),
};
// Fetch DAST findings for this session
let findings: Vec<DastFinding> = match agent
// Fetch DAST findings for this session, then deduplicate
Review

[medium] Inconsistent error handling pattern in export_session_report

The code uses match expressions for database operations but handles errors by returning empty vectors. This creates inconsistency with other parts of the codebase that likely use more robust error propagation patterns.

Suggested fix: Use consistent error handling approach throughout the codebase, possibly propagating errors instead of silently returning empty collections

*Scanner: code-review/convention | *

**[medium] Inconsistent error handling pattern in export_session_report** The code uses `match` expressions for database operations but handles errors by returning empty vectors. This creates inconsistency with other parts of the codebase that likely use more robust error propagation patterns. Suggested fix: Use consistent error handling approach throughout the codebase, possibly propagating errors instead of silently returning empty collections *Scanner: code-review/convention | *
Review

[medium] Deeply nested control flow in export_session_report

The export_session_report function has deeply nested control flow with multiple match statements and conditional blocks that make it difficult to follow the main logic path. The deduplication process adds additional nesting.

Suggested fix: Consider extracting the DAST findings fetching and deduplication logic into a separate function to flatten the control flow and improve readability.

*Scanner: code-review/complexity | *

**[medium] Deeply nested control flow in export_session_report** The export_session_report function has deeply nested control flow with multiple match statements and conditional blocks that make it difficult to follow the main logic path. The deduplication process adds additional nesting. Suggested fix: Consider extracting the DAST findings fetching and deduplication logic into a separate function to flatten the control flow and improve readability. *Scanner: code-review/complexity | *
let raw_findings: Vec<DastFinding> = match agent
.db
.dast_findings()
.find(doc! { "session_id": &id })
@@ -106,6 +106,14 @@ pub async fn export_session_report(
Ok(cursor) => collect_cursor_async(cursor).await,
Err(_) => Vec::new(),
};
let raw_count = raw_findings.len();
let findings = crate::pipeline::dedup::dedup_dast_findings(raw_findings);
if findings.len() < raw_count {
tracing::info!(
Review

[medium] Potential Information Disclosure Through Tracing Logs

The code logs information about deduplicated findings including raw counts and final counts. If sensitive data is included in these logs, it could expose internal system details or potentially sensitive findings to unauthorized parties.

Suggested fix: Ensure that logging does not include sensitive information from findings. Consider using structured logging with configurable log levels to prevent accidental exposure of sensitive data.

Scanner: code-review/security | CWE: CWE-200

**[medium] Potential Information Disclosure Through Tracing Logs** The code logs information about deduplicated findings including raw counts and final counts. If sensitive data is included in these logs, it could expose internal system details or potentially sensitive findings to unauthorized parties. Suggested fix: Ensure that logging does not include sensitive information from findings. Consider using structured logging with configurable log levels to prevent accidental exposure of sensitive data. *Scanner: code-review/security | CWE: CWE-200*
"Deduped DAST findings for session {id}: {raw_count} → {}",
findings.len()
);
}
// Fetch SAST findings, SBOM, and code context for the linked repository
let repo_id = session

View File

@@ -321,9 +321,38 @@ impl PentestOrchestrator {
total_findings += findings_count;
let mut finding_ids: Vec<String> = Vec::new();
for mut finding in result.findings {
// Dedup findings within this tool result before inserting
let deduped_findings =
crate::pipeline::dedup::dedup_dast_findings(
result.findings,
);
for mut finding in deduped_findings {
finding.scan_run_id = session_id.clone();
finding.session_id = Some(session_id.clone());
// Check for existing duplicate in this session
let fp = crate::pipeline::dedup::compute_dast_fingerprint(
&finding,
);
let existing = self
.db
.dast_findings()
.find_one(doc! {
Review

[medium] Insecure Deduplication Logic in DAST Findings

The deduplication logic for DAST findings relies on a combination of title, endpoint, and method fields which may not be sufficient to uniquely identify findings. This could lead to false negatives where actual duplicates are not detected due to minor variations in these fields.

Suggested fix: Enhance the deduplication criteria by incorporating additional unique identifiers such as vulnerability signature hashes or content-based fingerprints that are more robust than just title, endpoint, and method.

Scanner: code-review/security | CWE: CWE-400

**[medium] Insecure Deduplication Logic in DAST Findings** The deduplication logic for DAST findings relies on a combination of title, endpoint, and method fields which may not be sufficient to uniquely identify findings. This could lead to false negatives where actual duplicates are not detected due to minor variations in these fields. Suggested fix: Enhance the deduplication criteria by incorporating additional unique identifiers such as vulnerability signature hashes or content-based fingerprints that are more robust than just title, endpoint, and method. *Scanner: code-review/security | CWE: CWE-400*
"session_id": &session_id,
"title": &finding.title,
"endpoint": &finding.endpoint,
"method": &finding.method,
})
Review

[high] Potential race condition in duplicate checking

In compliance-agent/src/pentest/orchestrator.rs, the duplicate check for DAST findings is performed before insertion but not atomically with the insertion. This could lead to race conditions where two concurrent processes might both pass the duplicate check and insert the same finding.

Suggested fix: Use a database-level unique constraint or atomic upsert operation instead of separate find and insert operations to prevent race conditions.

*Scanner: code-review/logic | *

**[high] Potential race condition in duplicate checking** In `compliance-agent/src/pentest/orchestrator.rs`, the duplicate check for DAST findings is performed before insertion but not atomically with the insertion. This could lead to race conditions where two concurrent processes might both pass the duplicate check and insert the same finding. Suggested fix: Use a database-level unique constraint or atomic upsert operation instead of separate find and insert operations to prevent race conditions. *Scanner: code-review/logic | *
Review

[medium] Potential performance issue in PentestOrchestrator with redundant database queries

The code performs a database query for each finding to check for duplicates, which could be inefficient when processing many findings. Consider batching these checks or using a more efficient deduplication strategy.

Suggested fix: Consider optimizing duplicate detection by either batching database queries or pre-loading existing findings into memory for faster lookup

*Scanner: code-review/convention | *

**[medium] Potential performance issue in PentestOrchestrator with redundant database queries** The code performs a database query for each finding to check for duplicates, which could be inefficient when processing many findings. Consider batching these checks or using a more efficient deduplication strategy. Suggested fix: Consider optimizing duplicate detection by either batching database queries or pre-loading existing findings into memory for faster lookup *Scanner: code-review/convention | *
.await;
if matches!(existing, Ok(Some(_))) {
Review

[medium] Complex boolean expression in duplicate checking logic

The condition if finding.severity > existing.severity || (finding.severity == existing.severity && finding.description.len() > existing.description.len()) in orchestrator.rs is complex and hard to reason about. It combines multiple comparisons that could be simplified into a helper function.

Suggested fix: Extract the severity comparison logic into a separate function like should_replace_finding to improve readability and maintainability.

*Scanner: code-review/complexity | *

**[medium] Complex boolean expression in duplicate checking logic** The condition `if finding.severity > existing.severity || (finding.severity == existing.severity && finding.description.len() > existing.description.len())` in orchestrator.rs is complex and hard to reason about. It combines multiple comparisons that could be simplified into a helper function. Suggested fix: Extract the severity comparison logic into a separate function like `should_replace_finding` to improve readability and maintainability. *Scanner: code-review/complexity | *
tracing::debug!(
"Skipping duplicate DAST finding: {} (fp={:.12})",
finding.title,
fp,
);
continue;
}
let insert_result =
self.db.dast_findings().insert_one(&finding).await;
if let Ok(res) = &insert_result {

View File

@@ -66,8 +66,10 @@ impl CodeReviewScanner {
}
}
let deduped = dedup_cross_pass(all_findings);
ScanOutput {
findings: all_findings,
findings: deduped,
sbom_entries: Vec::new(),
}
}
Review

[medium] Inconsistent error handling in dedup_cross_pass function

The dedup_cross_pass function uses unwrap_or which can lead to unexpected behavior when file paths are None. This pattern should be handled more explicitly to avoid potential panics or silent data loss.

Suggested fix: Replace f.file_path.as_deref().unwrap_or("") with explicit error handling or a default value that makes sense for the deduplication logic, such as using a placeholder string like "unknown_file".

*Scanner: code-review/convention | *

**[medium] Inconsistent error handling in dedup_cross_pass function** The `dedup_cross_pass` function uses `unwrap_or` which can lead to unexpected behavior when file paths are None. This pattern should be handled more explicitly to avoid potential panics or silent data loss. Suggested fix: Replace `f.file_path.as_deref().unwrap_or("")` with explicit error handling or a default value that makes sense for the deduplication logic, such as using a placeholder string like "unknown_file". *Scanner: code-review/convention | *
@@ -184,3 +186,51 @@ struct ReviewIssue {
#[serde(default)]
suggestion: Option<String>,
}
Review

[low] Missing type annotation for dedup_cross_pass parameter

The function dedup_cross_pass takes a parameter findings without explicit type annotation, which could lead to confusion about expected input types and make refactoring harder.

Suggested fix: Add explicit type annotation to parameter: findings: Vec<Finding>

*Scanner: code-review/convention | *

**[low] Missing type annotation for dedup_cross_pass parameter** The function `dedup_cross_pass` takes a parameter `findings` without explicit type annotation, which could lead to confusion about expected input types and make refactoring harder. Suggested fix: Add explicit type annotation to parameter: `findings: Vec<Finding>` *Scanner: code-review/convention | *
Review

[high] Complex deduplication logic in cross-pass code review

The dedup_cross_pass function contains complex logic for building deduplication keys and comparing findings. The combination of string formatting, word normalization, and multi-criteria comparison makes this function hard to understand and maintain.

Suggested fix: Break down the deduplication logic into smaller helper functions: one for generating the dedup key, another for comparing findings by severity/description length, and a third for merging CWE information.

*Scanner: code-review/complexity | *

**[high] Complex deduplication logic in cross-pass code review** The `dedup_cross_pass` function contains complex logic for building deduplication keys and comparing findings. The combination of string formatting, word normalization, and multi-criteria comparison makes this function hard to understand and maintain. Suggested fix: Break down the deduplication logic into smaller helper functions: one for generating the dedup key, another for comparing findings by severity/description length, and a third for merging CWE information. *Scanner: code-review/complexity | *
/// Deduplicate findings across review passes.
///
/// Multiple passes often flag the same issue (e.g. SQL injection reported by
/// logic, security, and convention passes). We group by file + nearby line +
/// normalized title keywords and keep the highest-severity finding.
Review

[medium] Weak Deduplication Key Construction in Cross-Pass Code Review

The deduplication key construction in dedup_cross_pass function uses a simple line bucketing approach (line_number / 4) which might incorrectly group unrelated findings or fail to detect true duplicates when line numbers vary slightly. Additionally, the normalization of titles may not handle all edge cases properly.

Suggested fix: Improve the deduplication key generation by using more sophisticated techniques like fuzzy matching or semantic analysis to better identify truly duplicate findings regardless of minor differences in formatting or location.

Scanner: code-review/security | CWE: CWE-400

**[medium] Weak Deduplication Key Construction in Cross-Pass Code Review** The deduplication key construction in `dedup_cross_pass` function uses a simple line bucketing approach (`line_number / 4`) which might incorrectly group unrelated findings or fail to detect true duplicates when line numbers vary slightly. Additionally, the normalization of titles may not handle all edge cases properly. Suggested fix: Improve the deduplication key generation by using more sophisticated techniques like fuzzy matching or semantic analysis to better identify truly duplicate findings regardless of minor differences in formatting or location. *Scanner: code-review/security | CWE: CWE-400*
fn dedup_cross_pass(findings: Vec<Finding>) -> Vec<Finding> {
use std::collections::HashMap;
Review

[medium] Off-by-one error in line bucket calculation

In compliance-agent/src/pipeline/code_review.rs, the line bucket calculation uses f.line_number.unwrap_or(0) / 4 which will put line 0, 1, 2, 3 into bucket 0, but line 4 will be in bucket 1. This means that lines 3 and 4 are considered different buckets when they should probably be grouped together for deduplication purposes.

Suggested fix: Change the line bucket calculation to (f.line_number.unwrap_or(0) + 3) / 4 to properly group lines within 3 of each other.

*Scanner: code-review/logic | *

**[medium] Off-by-one error in line bucket calculation** In `compliance-agent/src/pipeline/code_review.rs`, the line bucket calculation uses `f.line_number.unwrap_or(0) / 4` which will put line 0, 1, 2, 3 into bucket 0, but line 4 will be in bucket 1. This means that lines 3 and 4 are considered different buckets when they should probably be grouped together for deduplication purposes. Suggested fix: Change the line bucket calculation to `(f.line_number.unwrap_or(0) + 3) / 4` to properly group lines within 3 of each other. *Scanner: code-review/logic | *
// Build a dedup key: (file, line bucket, normalized title words)
fn dedup_key(f: &Finding) -> String {
let file = f.file_path.as_deref().unwrap_or("");
// Group lines within 3 of each other
let line_bucket = f.line_number.unwrap_or(0) / 4;
// Normalize: lowercase, keep only alphanumeric, sort words for order-independence
let title_lower = f.title.to_lowercase();
let mut words: Vec<&str> = title_lower
.split(|c: char| !c.is_alphanumeric())
.filter(|w| w.len() > 2)
.collect();
words.sort();
format!("{file}:{line_bucket}:{}", words.join(","))
}
let mut groups: HashMap<String, Finding> = HashMap::new();
for finding in findings {
let key = dedup_key(&finding);
Review

[high] Incorrect deduplication logic in cross-pass deduplication

In compliance-agent/src/pipeline/code_review.rs, the deduplication function compares descriptions by length rather than content. This means that a finding with a longer description will overwrite a finding with a shorter one even if the shorter one has higher severity, which is incorrect behavior.

Suggested fix: Fix the comparison logic to prioritize severity first, then consider description length only as a tiebreaker, or better yet, compare actual description content for meaningful deduplication.

*Scanner: code-review/logic | *

**[high] Incorrect deduplication logic in cross-pass deduplication** In `compliance-agent/src/pipeline/code_review.rs`, the deduplication function compares descriptions by length rather than content. This means that a finding with a longer description will overwrite a finding with a shorter one even if the shorter one has higher severity, which is incorrect behavior. Suggested fix: Fix the comparison logic to prioritize severity first, then consider description length only as a tiebreaker, or better yet, compare actual description content for meaningful deduplication. *Scanner: code-review/logic | *
groups
.entry(key)
.and_modify(|existing| {
// Keep the higher severity; on tie, keep the one with more detail
if finding.severity > existing.severity
|| (finding.severity == existing.severity
&& finding.description.len() > existing.description.len())
{
*existing = finding.clone();
}
// Merge CWE if the existing one is missing it
if existing.cwe.is_none() {
existing.cwe = finding.cwe.clone();
}
})
.or_insert(finding);
}
groups.into_values().collect()
}

View File

@@ -1,5 +1,7 @@
use sha2::{Digest, Sha256};
use compliance_core::models::dast::DastFinding;
pub fn compute_fingerprint(parts: &[&str]) -> String {
let mut hasher = Sha256::new();
for part in parts {
@@ -9,9 +11,209 @@ pub fn compute_fingerprint(parts: &[&str]) -> String {
hex::encode(hasher.finalize())
}
/// Compute a dedup fingerprint for a DAST finding.
Review

[medium] Insecure Hash Usage for Deduplication

While SHA256 is cryptographically secure, using it for deduplication purposes without proper input sanitization can lead to issues. The function computes fingerprints based on user-provided data (titles, endpoints, methods) which could potentially be crafted to cause hash collisions or other issues if not properly validated.

Suggested fix: Ensure that the inputs to compute_fingerprint are properly sanitized and validated before hashing. Consider adding input length limits to prevent potential DoS attacks through excessively long inputs.

Scanner: code-review/security | CWE: CWE-327

**[medium] Insecure Hash Usage for Deduplication** While SHA256 is cryptographically secure, using it for deduplication purposes without proper input sanitization can lead to issues. The function computes fingerprints based on user-provided data (titles, endpoints, methods) which could potentially be crafted to cause hash collisions or other issues if not properly validated. Suggested fix: Ensure that the inputs to compute_fingerprint are properly sanitized and validated before hashing. Consider adding input length limits to prevent potential DoS attacks through excessively long inputs. *Scanner: code-review/security | CWE: CWE-327*
///
/// The key is derived from the *canonicalized* title (lowercased, domain names
/// stripped, known synonyms resolved), endpoint, and HTTP method. This lets us
/// detect both exact duplicates (same tool reporting twice across passes) and
/// semantic duplicates (e.g., `security_header_missing` "Missing HSTS header"
/// vs `tls_misconfiguration` "Missing strict-transport-security header").
pub fn compute_dast_fingerprint(f: &DastFinding) -> String {
let canon = canonicalize_dast_title(&f.title);
let endpoint = f.endpoint.to_lowercase().trim_end_matches('/').to_string();
let method = f.method.to_uppercase();
let param = f.parameter.as_deref().unwrap_or("");
compute_fingerprint(&[&canon, &endpoint, &method, param])
}
/// Canonicalize a DAST finding title for dedup purposes.
///
/// 1. Lowercase
/// 2. Strip domain names / URLs (e.g. "for comp-dev.meghsakha.com")
/// 3. Resolve known header synonyms (hsts ↔ strict-transport-security, etc.)
/// 4. Strip extra whitespace
fn canonicalize_dast_title(title: &str) -> String {
let mut s = title.to_lowercase();
// Strip "for <domain>" or "on <domain>" suffixes
Review

[high] Incorrect string truncation in canonicalize_dast_title

The canonicalize_dast_title function uses s.truncate(idx) which removes everything from index idx onwards, but this can leave trailing spaces or break word boundaries. For example, 'test for example.com' becomes 'test ' instead of 'test'.

Suggested fix: Replace s.truncate(idx) with s.truncate(idx.saturating_sub(1)) or better yet, use a more robust approach that properly handles whitespace around the truncated portion.

*Scanner: code-review/logic | *

**[high] Incorrect string truncation in canonicalize_dast_title** The canonicalize_dast_title function uses s.truncate(idx) which removes everything from index idx onwards, but this can leave trailing spaces or break word boundaries. For example, 'test for example.com' becomes 'test ' instead of 'test'. Suggested fix: Replace s.truncate(idx) with s.truncate(idx.saturating_sub(1)) or better yet, use a more robust approach that properly handles whitespace around the truncated portion. *Scanner: code-review/logic | *
Review

[medium] Insecure String Manipulation in Canonicalization

The canonicalize_dast_title function uses manual string manipulation and pattern matching that could be vulnerable to injection or unexpected behavior. Specifically, the domain stripping logic relies on simple substring checks which may not properly handle all edge cases, potentially leading to incorrect deduplication or information leakage.

Suggested fix: Use more robust parsing techniques or regular expressions for domain/URL detection instead of simple substring matching. Consider using a dedicated URL parsing library to properly identify and strip domains.

Scanner: code-review/security | CWE: CWE-1104

**[medium] Insecure String Manipulation in Canonicalization** The canonicalize_dast_title function uses manual string manipulation and pattern matching that could be vulnerable to injection or unexpected behavior. Specifically, the domain stripping logic relies on simple substring checks which may not properly handle all edge cases, potentially leading to incorrect deduplication or information leakage. Suggested fix: Use more robust parsing techniques or regular expressions for domain/URL detection instead of simple substring matching. Consider using a dedicated URL parsing library to properly identify and strip domains. *Scanner: code-review/security | CWE: CWE-1104*
// Pattern: "for <word.word...>" or "on <method> <url>"
if let Some(idx) = s.find(" for ") {
// Check if what follows looks like a domain or URL
let rest = &s[idx + 5..];
if rest.contains('.') || rest.starts_with("http") {
s.truncate(idx);
}
}
if let Some(idx) = s.find(" on ") {
let rest = &s[idx + 4..];
if rest.contains("http") || rest.contains('/') {
s.truncate(idx);
}
}
// Resolve known header synonyms
Review

[high] Potential index out of bounds in canonicalize_dast_title

In canonicalize_dast_title, when checking for synonyms, the code accesses bytes at position pos + short.len() without ensuring that pos + short.len() is within bounds of the string. This could cause a panic if short is near the end of the string.

Suggested fix: Add bounds checking before accessing s.as_bytes()[pos + short.len()] to prevent potential panics.

*Scanner: code-review/logic | *

**[high] Potential index out of bounds in canonicalize_dast_title** In canonicalize_dast_title, when checking for synonyms, the code accesses bytes at position pos + short.len() without ensuring that pos + short.len() is within bounds of the string. This could cause a panic if short is near the end of the string. Suggested fix: Add bounds checking before accessing s.as_bytes()[pos + short.len()] to prevent potential panics. *Scanner: code-review/logic | *
let synonyms: &[(&str, &str)] = &[
Review

[high] Potential panic in canonicalize_dast_title due to unchecked indexing

In canonicalize_dast_title, there's a potential panic when accessing s.as_bytes()[pos - 1] without ensuring pos > 0. If pos is 0, this would access memory before the string, causing undefined behavior.

Suggested fix: Add a bounds check before accessing s.as_bytes()[pos - 1] to ensure pos > 0 before dereferencing.

*Scanner: code-review/convention | *

**[high] Potential panic in `canonicalize_dast_title` due to unchecked indexing** In `canonicalize_dast_title`, there's a potential panic when accessing `s.as_bytes()[pos - 1]` without ensuring `pos > 0`. If `pos` is 0, this would access memory before the string, causing undefined behavior. Suggested fix: Add a bounds check before accessing `s.as_bytes()[pos - 1]` to ensure `pos > 0` before dereferencing. *Scanner: code-review/convention | *
("hsts", "strict-transport-security"),
("csp", "content-security-policy"),
("cors", "cross-origin-resource-sharing"),
Review

[medium] Complex boolean expression in canonicalize_dast_title

The canonicalize_dast_title function contains complex boolean logic for checking word boundaries when replacing synonyms. The condition before_ok && after_ok involves multiple checks that are hard to follow and maintain.

Suggested fix: Extract the boundary checking logic into a separate helper function to improve readability and maintainability.

*Scanner: code-review/complexity | *

**[medium] Complex boolean expression in canonicalize_dast_title** The canonicalize_dast_title function contains complex boolean logic for checking word boundaries when replacing synonyms. The condition `before_ok && after_ok` involves multiple checks that are hard to follow and maintain. Suggested fix: Extract the boundary checking logic into a separate helper function to improve readability and maintainability. *Scanner: code-review/complexity | *
("xfo", "x-frame-options"),
];
for &(short, canonical) in synonyms {
// Only replace whole words — check boundaries
if let Some(pos) = s.find(short) {
let before_ok = pos == 0 || !s.as_bytes()[pos - 1].is_ascii_alphanumeric();
let after_ok = pos + short.len() >= s.len()
|| !s.as_bytes()[pos + short.len()].is_ascii_alphanumeric();
if before_ok && after_ok {
s = format!("{}{}{}", &s[..pos], canonical, &s[pos + short.len()..]);
}
}
}
// Collapse whitespace
s.split_whitespace().collect::<Vec<_>>().join(" ")
}
/// Deduplicate a list of DAST findings, merging evidence from duplicates.
///
/// Two-phase approach:
/// 1. **Exact dedup** — group by canonicalized `(title, endpoint, method, parameter)`.
/// Merge evidence arrays, keep the highest severity, preserve exploitable flag.
/// 2. **CWE-based dedup** — within the same `(cwe, endpoint, method)` group, merge
/// findings whose canonicalized titles resolve to the same subject (e.g., HSTS
/// reported as both `security_header_missing` and `tls_misconfiguration`).
pub fn dedup_dast_findings(findings: Vec<DastFinding>) -> Vec<DastFinding> {
use std::collections::HashMap;
if findings.len() <= 1 {
return findings;
}
// Phase 1: exact fingerprint dedup
let mut seen: HashMap<String, usize> = HashMap::new();
let mut deduped: Vec<DastFinding> = Vec::new();
for finding in findings {
let fp = compute_dast_fingerprint(&finding);
if let Some(&idx) = seen.get(&fp) {
// Merge into existing
merge_dast_finding(&mut deduped[idx], &finding);
} else {
seen.insert(fp, deduped.len());
deduped.push(finding);
}
}
let before = deduped.len();
// Phase 2: CWE-based related dedup
// Group by (cwe, endpoint_normalized, method) — only when CWE is present
let mut cwe_groups: HashMap<String, Vec<usize>> = HashMap::new();
for (i, f) in deduped.iter().enumerate() {
if let Some(ref cwe) = f.cwe {
Review

[high] Large function with multiple responsibilities

The dedup_dast_findings function is quite large (over 100 lines) and handles multiple distinct responsibilities: exact deduplication, CWE-based deduplication, merging logic, and result processing. This makes it hard to test and maintain.

Suggested fix: Break down this function into smaller, focused functions such as 'perform_exact_dedup', 'perform_cwe_dedup', and 'merge_findings' to improve modularity and testability.

*Scanner: code-review/complexity | *

**[high] Large function with multiple responsibilities** The dedup_dast_findings function is quite large (over 100 lines) and handles multiple distinct responsibilities: exact deduplication, CWE-based deduplication, merging logic, and result processing. This makes it hard to test and maintain. Suggested fix: Break down this function into smaller, focused functions such as 'perform_exact_dedup', 'perform_cwe_dedup', and 'merge_findings' to improve modularity and testability. *Scanner: code-review/complexity | *
let key = format!(
"{}|{}|{}",
cwe,
f.endpoint.to_lowercase().trim_end_matches('/'),
f.method.to_uppercase(),
);
Review

[medium] Deeply nested control flow in dedup_dast_findings

The dedup_dast_findings function has deeply nested control flow with multiple nested if statements and match expressions, making it difficult to follow the execution path and increasing the risk of logic errors.

Suggested fix: Refactor the nested conditions into early returns or separate helper functions to flatten the control flow and improve readability.

*Scanner: code-review/complexity | *

**[medium] Deeply nested control flow in dedup_dast_findings** The dedup_dast_findings function has deeply nested control flow with multiple nested if statements and match expressions, making it difficult to follow the execution path and increasing the risk of logic errors. Suggested fix: Refactor the nested conditions into early returns or separate helper functions to flatten the control flow and improve readability. *Scanner: code-review/complexity | *
cwe_groups.entry(key).or_default().push(i);
}
}
// For each CWE group with multiple findings, keep the one with highest severity
// and most evidence, merge the rest into it
let mut merge_map: HashMap<usize, Vec<usize>> = HashMap::new();
let mut remove_indices: Vec<usize> = Vec::new();
for indices in cwe_groups.values() {
if indices.len() <= 1 {
continue;
}
// Find the "primary" finding: highest severity, then most evidence, then longest description
let Some(&primary_idx) = indices.iter().max_by(|&&a, &&b| {
deduped[a]
.severity
.cmp(&deduped[b].severity)
.then_with(|| deduped[a].evidence.len().cmp(&deduped[b].evidence.len()))
.then_with(|| {
Review

[medium] Inconsistent error handling in dedup_dast_findings

The dedup_dast_findings function uses unwrap_or in one place but has explicit if let Some(...) checks elsewhere. This inconsistency can lead to confusion about how errors are handled and might mask potential issues in the deduplication logic.

Suggested fix: Use consistent error handling patterns throughout the function. Prefer explicit match expressions or ? operators for better clarity and error propagation.

*Scanner: code-review/convention | *

**[medium] Inconsistent error handling in dedup_dast_findings** The `dedup_dast_findings` function uses `unwrap_or` in one place but has explicit `if let Some(...)` checks elsewhere. This inconsistency can lead to confusion about how errors are handled and might mask potential issues in the deduplication logic. Suggested fix: Use consistent error handling patterns throughout the function. Prefer explicit `match` expressions or `?` operators for better clarity and error propagation. *Scanner: code-review/convention | *
deduped[a]
.description
.len()
.cmp(&deduped[b].description.len())
})
}) else {
continue;
};
for &idx in indices {
if idx != primary_idx {
remove_indices.push(idx);
merge_map.entry(primary_idx).or_default().push(idx);
}
}
}
if !remove_indices.is_empty() {
remove_indices.sort_unstable();
remove_indices.dedup();
// Merge evidence
for (&primary, secondaries) in &merge_map {
let extra_evidence: Vec<_> = secondaries
.iter()
.flat_map(|&i| deduped[i].evidence.clone())
.collect();
let any_exploitable = secondaries.iter().any(|&i| deduped[i].exploitable);
deduped[primary].evidence.extend(extra_evidence);
if any_exploitable {
deduped[primary].exploitable = true;
}
}
// Remove merged findings (iterate in reverse to preserve indices)
for &idx in remove_indices.iter().rev() {
deduped.remove(idx);
}
}
let after = deduped.len();
if before != after {
tracing::debug!(
"DAST CWE-based dedup: {before} → {after} findings ({} merged)",
before - after
);
}
deduped
}
/// Merge a duplicate DAST finding into a primary one.
fn merge_dast_finding(primary: &mut DastFinding, duplicate: &DastFinding) {
primary.evidence.extend(duplicate.evidence.clone());
if duplicate.severity > primary.severity {
primary.severity = duplicate.severity.clone();
}
if duplicate.exploitable {
primary.exploitable = true;
}
// Keep the longer/better description
if duplicate.description.len() > primary.description.len() {
primary.description.clone_from(&duplicate.description);
}
// Keep remediation if primary doesn't have one
if primary.remediation.is_none() && duplicate.remediation.is_some() {
primary.remediation.clone_from(&duplicate.remediation);
}
}
#[cfg(test)]
mod tests {
use super::*;
use compliance_core::models::dast::DastVulnType;
use compliance_core::models::finding::Severity;
#[test]
fn fingerprint_is_deterministic() {
@@ -55,4 +257,159 @@ mod tests {
let b = compute_fingerprint(&["a", "bc"]);
assert_ne!(a, b);
}
fn make_dast(title: &str, endpoint: &str, vuln_type: DastVulnType) -> DastFinding {
let mut f = DastFinding::new(
"run1".into(),
"target1".into(),
vuln_type,
title.into(),
format!("Description for {title}"),
Severity::Medium,
endpoint.into(),
"GET".into(),
);
f.cwe = Some("CWE-319".into());
f
}
#[test]
fn canonicalize_strips_domain_suffix() {
let canon = canonicalize_dast_title("Missing HSTS header for comp-dev.meghsakha.com");
Review

[medium] Incorrect severity comparison in merge_dast_finding

In merge_dast_finding, the condition 'if duplicate.severity > primary.severity' assumes that Severity implements Ord correctly and that the comparison works as expected. If Severity does not implement Ord properly or has custom ordering, this could lead to incorrect behavior.

Suggested fix: Verify that Severity implements Ord correctly and add explicit debug assertions to ensure the comparison behaves as expected.

*Scanner: code-review/logic | *

**[medium] Incorrect severity comparison in merge_dast_finding** In merge_dast_finding, the condition 'if duplicate.severity > primary.severity' assumes that Severity implements Ord correctly and that the comparison works as expected. If Severity does not implement Ord properly or has custom ordering, this could lead to incorrect behavior. Suggested fix: Verify that Severity implements Ord correctly and add explicit debug assertions to ensure the comparison behaves as expected. *Scanner: code-review/logic | *
assert!(!canon.contains("meghsakha"), "domain should be stripped");
assert!(
canon.contains("strict-transport-security"),
"hsts should be resolved: {canon}"
Review

[medium] Potential panic in merge_dast_finding due to clone usage

In merge_dast_finding, the line 'primary.description.clone_from(&duplicate.description)' could panic if the description field is not properly initialized or if there's an issue with the clone operation. However, more importantly, the comparison logic in dedup_dast_findings might not handle all edge cases correctly.

Suggested fix: Ensure that description fields are always initialized before calling clone_from, and consider adding defensive checks.

*Scanner: code-review/logic | *

**[medium] Potential panic in merge_dast_finding due to clone usage** In merge_dast_finding, the line 'primary.description.clone_from(&duplicate.description)' could panic if the description field is not properly initialized or if there's an issue with the clone operation. However, more importantly, the comparison logic in dedup_dast_findings might not handle all edge cases correctly. Suggested fix: Ensure that description fields are always initialized before calling clone_from, and consider adding defensive checks. *Scanner: code-review/logic | *
Review

[medium] Inconsistent use of clone() in merge_dast_finding

The merge_dast_finding function uses clone() on duplicate.description and duplicate.remediation, but not on other fields like duplicate.evidence. This creates an inconsistent pattern where some fields are cloned while others are moved, potentially leading to performance issues or unexpected behavior.

Suggested fix: Use clone() consistently for all fields that need to be copied, or use clone_from for fields that are already owned. Consider using std::mem::replace or similar patterns for more efficient moves.

*Scanner: code-review/convention | *

**[medium] Inconsistent use of `clone()` in merge_dast_finding** The `merge_dast_finding` function uses `clone()` on `duplicate.description` and `duplicate.remediation`, but not on other fields like `duplicate.evidence`. This creates an inconsistent pattern where some fields are cloned while others are moved, potentially leading to performance issues or unexpected behavior. Suggested fix: Use `clone()` consistently for all fields that need to be copied, or use `clone_from` for fields that are already owned. Consider using `std::mem::replace` or similar patterns for more efficient moves. *Scanner: code-review/convention | *
);
}
#[test]
fn canonicalize_resolves_synonyms() {
let a = canonicalize_dast_title("Missing HSTS header");
let b = canonicalize_dast_title("Missing strict-transport-security header");
assert_eq!(a, b);
}
#[test]
fn exact_dedup_merges_identical_findings() {
let f1 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
let f2 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
let result = dedup_dast_findings(vec![f1, f2]);
assert_eq!(result.len(), 1, "exact duplicates should be merged");
}
#[test]
fn synonym_dedup_merges_hsts_variants() {
let f1 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
let f2 = make_dast(
"Missing HSTS header for example.com",
"https://example.com",
DastVulnType::TlsMisconfiguration,
);
let result = dedup_dast_findings(vec![f1, f2]);
assert_eq!(
result.len(),
1,
"HSTS synonym variants should merge to 1 finding"
);
}
#[test]
fn different_headers_not_merged() {
let mut f1 = make_dast(
"Missing x-content-type-options header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
f1.cwe = Some("CWE-16".into());
let mut f2 = make_dast(
"Missing permissions-policy header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
f2.cwe = Some("CWE-16".into());
// These share CWE-16 but are different headers — phase 2 will merge them
// since they share the same CWE+endpoint. This is acceptable because they
// have the same root cause (missing security headers configuration).
let result = dedup_dast_findings(vec![f1, f2]);
// CWE-based dedup will merge these into 1
assert!(
result.len() <= 2,
"same CWE+endpoint findings may be merged"
);
}
#[test]
fn different_endpoints_not_merged() {
let f1 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
let f2 = make_dast(
"Missing strict-transport-security header",
"https://other.com",
DastVulnType::SecurityHeaderMissing,
);
let result = dedup_dast_findings(vec![f1, f2]);
assert_eq!(result.len(), 2, "different endpoints should not merge");
}
#[test]
fn dedup_preserves_highest_severity() {
let f1 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
let mut f2 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
f2.severity = Severity::High;
let result = dedup_dast_findings(vec![f1, f2]);
assert_eq!(result.len(), 1);
assert_eq!(result[0].severity, Severity::High);
}
#[test]
fn dedup_merges_evidence() {
let mut f1 = make_dast(
"Missing strict-transport-security header",
"https://example.com",
DastVulnType::SecurityHeaderMissing,
);
f1.evidence
.push(compliance_core::models::dast::DastEvidence {
request_method: "GET".into(),
request_url: "https://example.com".into(),
request_headers: None,
request_body: None,
response_status: 200,
response_headers: None,
response_snippet: Some("pass 1".into()),
screenshot_path: None,
payload: None,
response_time_ms: None,
});
let mut f2 = f1.clone();
f2.evidence[0].response_snippet = Some("pass 2".into());
let result = dedup_dast_findings(vec![f1, f2]);
assert_eq!(result.len(), 1);
assert_eq!(result[0].evidence.len(), 2, "evidence should be merged");
}
}

View File

@@ -1,5 +1,6 @@
use compliance_core::models::*;
use super::dedup::compute_fingerprint;
use super::orchestrator::PipelineOrchestrator;
use crate::error::AgentError;
use crate::pipeline::code_review::CodeReviewScanner;
@@ -89,12 +90,37 @@ impl PipelineOrchestrator {
return Ok(());
}
// Dedup findings by fingerprint to avoid duplicate comments
Review

[medium] Complex boolean expression in finding deduplication

The fingerprint computation and deduplication logic uses a complex boolean expression with multiple unwraps and nested conditions that make it hard to reason about. The code also has duplicated fingerprint computation logic.

Suggested fix: Extract the fingerprint computation into a separate function and simplify the deduplication logic. Consider using a more functional approach with iterators and map/filter operations instead of manual loops.

*Scanner: code-review/complexity | *

**[medium] Complex boolean expression in finding deduplication** The fingerprint computation and deduplication logic uses a complex boolean expression with multiple unwraps and nested conditions that make it hard to reason about. The code also has duplicated fingerprint computation logic. Suggested fix: Extract the fingerprint computation into a separate function and simplify the deduplication logic. Consider using a more functional approach with iterators and map/filter operations instead of manual loops. *Scanner: code-review/complexity | *
let mut seen_fps = std::collections::HashSet::new();
let mut unique_findings: Vec<&Finding> = Vec::new();
for finding in &pr_findings {
let fp = compute_fingerprint(&[
repo_id,
Review

[medium] Potential panic from unwrap in fingerprint computation

The code uses finding.file_path.as_deref().unwrap_or("") which could panic if file_path is None and the fallback is not properly handled. While it's unlikely to be None, using unwrap_or_default() or similar safer alternatives would prevent potential panics.

Suggested fix: Replace unwrap_or("") with unwrap_or_default() or handle the Option more explicitly to avoid potential panics.

*Scanner: code-review/convention | *

**[medium] Potential panic from unwrap in fingerprint computation** The code uses `finding.file_path.as_deref().unwrap_or("")` which could panic if file_path is None and the fallback is not properly handled. While it's unlikely to be None, using `unwrap_or_default()` or similar safer alternatives would prevent potential panics. Suggested fix: Replace `unwrap_or("")` with `unwrap_or_default()` or handle the Option more explicitly to avoid potential panics. *Scanner: code-review/convention | *
&pr_number.to_string(),
finding.file_path.as_deref().unwrap_or(""),
Review

[medium] Insecure Direct Object Reference in Finding Processing

The code processes findings based on file paths and line numbers without proper validation. If these values are derived from user input or external sources, they could potentially allow an attacker to access unintended files or lines through manipulation of the finding data.

Suggested fix: Validate and sanitize all inputs related to file paths and line numbers before processing them. Ensure that the file paths are within expected boundaries and that line numbers are valid for the given file.

Scanner: code-review/security | CWE: CWE-639

**[medium] Insecure Direct Object Reference in Finding Processing** The code processes findings based on file paths and line numbers without proper validation. If these values are derived from user input or external sources, they could potentially allow an attacker to access unintended files or lines through manipulation of the finding data. Suggested fix: Validate and sanitize all inputs related to file paths and line numbers before processing them. Ensure that the file paths are within expected boundaries and that line numbers are valid for the given file. *Scanner: code-review/security | CWE: CWE-639*
&finding.line_number.unwrap_or(0).to_string(),
&finding.title,
]);
Review

[high] Incorrect deduplication logic

The deduplication logic uses a HashSet to track fingerprints but then reassigns the pr_findings variable which shadows the original one. This means the deduplication only affects the local unique_findings vector, not the actual pr_findings that get processed later. The original pr_findings is never updated with unique_findings.

Suggested fix: Remove the line 'let pr_findings = unique_findings;' and instead modify the original pr_findings variable directly, or better yet, replace the entire loop with a more idiomatic approach using iterator methods.

*Scanner: code-review/logic | *

**[high] Incorrect deduplication logic** The deduplication logic uses a HashSet to track fingerprints but then reassigns the pr_findings variable which shadows the original one. This means the deduplication only affects the local unique_findings vector, not the actual pr_findings that get processed later. The original pr_findings is never updated with unique_findings. Suggested fix: Remove the line 'let pr_findings = unique_findings;' and instead modify the original pr_findings variable directly, or better yet, replace the entire loop with a more idiomatic approach using iterator methods. *Scanner: code-review/logic | *
if seen_fps.insert(fp) {
unique_findings.push(finding);
Review

[medium] Duplicated fingerprint computation logic

The fingerprint computation logic is duplicated in two places (lines 93-101 and 105-113) which increases maintenance burden and risk of inconsistencies.

Suggested fix: Create a helper function that takes the required components and returns the fingerprint to eliminate code duplication.

*Scanner: code-review/complexity | *

**[medium] Duplicated fingerprint computation logic** The fingerprint computation logic is duplicated in two places (lines 93-101 and 105-113) which increases maintenance burden and risk of inconsistencies. Suggested fix: Create a helper function that takes the required components and returns the fingerprint to eliminate code duplication. *Scanner: code-review/complexity | *
}
}
Review

[medium] Potential Information Disclosure via Comment Fingerprint

The code includes a compliance fingerprint in PR review comments using <!-- compliance-fp:{fp} -->. If the fingerprint contains sensitive information or can be used to reconstruct internal state, it may lead to information disclosure. The fingerprint is computed from repository ID, PR number, file path, line number, and finding title.

Suggested fix: Review the fingerprint computation logic to ensure it does not include sensitive data. Consider using a hash of non-sensitive components only, and validate that the fingerprint cannot be reverse-engineered to expose internal system details.

Scanner: code-review/security | CWE: CWE-200

**[medium] Potential Information Disclosure via Comment Fingerprint** The code includes a compliance fingerprint in PR review comments using `<!-- compliance-fp:{fp} -->`. If the fingerprint contains sensitive information or can be used to reconstruct internal state, it may lead to information disclosure. The fingerprint is computed from repository ID, PR number, file path, line number, and finding title. Suggested fix: Review the fingerprint computation logic to ensure it does not include sensitive data. Consider using a hash of non-sensitive components only, and validate that the fingerprint cannot be reverse-engineered to expose internal system details. *Scanner: code-review/security | CWE: CWE-200*
let pr_findings = unique_findings;
// Build review comments from findings
let mut review_comments = Vec::new();
for finding in &pr_findings {
if let (Some(path), Some(line)) = (&finding.file_path, finding.line_number) {
let fp = compute_fingerprint(&[
repo_id,
&pr_number.to_string(),
path,
&line.to_string(),
&finding.title,
]);
let comment_body = format!(
"**[{}] {}**\n\n{}\n\n*Scanner: {} | {}*",
"**[{}] {}**\n\n{}\n\n*Scanner: {} | {}*\n\n<!-- compliance-fp:{fp} -->",
finding.severity,
finding.title,
finding.description,
@@ -123,6 +149,17 @@ impl PipelineOrchestrator {
.join("\n"),
Review

[medium] Deep nesting in PR review handling

The PR review handling logic has multiple nested conditional blocks (if let chains and nested if statements) that make the control flow difficult to follow and increase cognitive load.

Suggested fix: Refactor the nested conditionals into early returns or extract the review creation logic into separate functions to flatten the control flow.

*Scanner: code-review/complexity | *

**[medium] Deep nesting in PR review handling** The PR review handling logic has multiple nested conditional blocks (if let chains and nested if statements) that make the control flow difficult to follow and increase cognitive load. Suggested fix: Refactor the nested conditionals into early returns or extract the review creation logic into separate functions to flatten the control flow. *Scanner: code-review/complexity | *
);
if review_comments.is_empty() {
Review

[medium] Inconsistent error handling in PR review pipeline

The code uses both tracing::warn! for logging and direct error propagation with ? operator inconsistently. The warning log is used for one case but not for others where errors might occur during PR review creation.

Suggested fix: Use consistent error handling pattern throughout the function. Either propagate all errors with ? or handle them uniformly with logging.

*Scanner: code-review/convention | *

**[medium] Inconsistent error handling in PR review pipeline** The code uses both `tracing::warn!` for logging and direct error propagation with `?` operator inconsistently. The warning log is used for one case but not for others where errors might occur during PR review creation. Suggested fix: Use consistent error handling pattern throughout the function. Either propagate all errors with `?` or handle them uniformly with logging. *Scanner: code-review/convention | *
// All findings were on files/lines we can't comment on inline
if let Err(e) = tracker
.create_pr_review(owner, tracker_repo_name, pr_number, &summary, Vec::new())
.await
{
tracing::warn!("[{repo_id}] Failed to post PR review summary: {e}");
}
return Ok(());
}
if let Err(e) = tracker
.create_pr_review(
owner,