feat: deduplicate code review findings across LLM passes #48
@@ -95,8 +95,8 @@ pub async fn export_session_report(
|
||||
Err(_) => Vec::new(),
|
||||
};
|
||||
|
||||
// Fetch DAST findings for this session
|
||||
let findings: Vec<DastFinding> = match agent
|
||||
// Fetch DAST findings for this session, then deduplicate
|
||||
|
|
||||
let raw_findings: Vec<DastFinding> = match agent
|
||||
.db
|
||||
.dast_findings()
|
||||
.find(doc! { "session_id": &id })
|
||||
@@ -106,6 +106,14 @@ pub async fn export_session_report(
|
||||
Ok(cursor) => collect_cursor_async(cursor).await,
|
||||
Err(_) => Vec::new(),
|
||||
};
|
||||
let raw_count = raw_findings.len();
|
||||
let findings = crate::pipeline::dedup::dedup_dast_findings(raw_findings);
|
||||
if findings.len() < raw_count {
|
||||
tracing::info!(
|
||||
|
sharang
commented
[medium] Potential Information Disclosure Through Tracing Logs The code logs information about deduplicated findings including raw counts and final counts. If sensitive data is included in these logs, it could expose internal system details or potentially sensitive findings to unauthorized parties. Suggested fix: Ensure that logging does not include sensitive information from findings. Consider using structured logging with configurable log levels to prevent accidental exposure of sensitive data. Scanner: code-review/security | CWE: CWE-200 **[medium] Potential Information Disclosure Through Tracing Logs**
The code logs information about deduplicated findings including raw counts and final counts. If sensitive data is included in these logs, it could expose internal system details or potentially sensitive findings to unauthorized parties.
Suggested fix: Ensure that logging does not include sensitive information from findings. Consider using structured logging with configurable log levels to prevent accidental exposure of sensitive data.
*Scanner: code-review/security | CWE: CWE-200*
|
||||
"Deduped DAST findings for session {id}: {raw_count} → {}",
|
||||
findings.len()
|
||||
);
|
||||
}
|
||||
|
||||
// Fetch SAST findings, SBOM, and code context for the linked repository
|
||||
let repo_id = session
|
||||
|
||||
@@ -321,9 +321,38 @@ impl PentestOrchestrator {
|
||||
total_findings += findings_count;
|
||||
|
||||
let mut finding_ids: Vec<String> = Vec::new();
|
||||
for mut finding in result.findings {
|
||||
// Dedup findings within this tool result before inserting
|
||||
let deduped_findings =
|
||||
crate::pipeline::dedup::dedup_dast_findings(
|
||||
result.findings,
|
||||
);
|
||||
for mut finding in deduped_findings {
|
||||
finding.scan_run_id = session_id.clone();
|
||||
finding.session_id = Some(session_id.clone());
|
||||
|
||||
// Check for existing duplicate in this session
|
||||
let fp = crate::pipeline::dedup::compute_dast_fingerprint(
|
||||
&finding,
|
||||
);
|
||||
let existing = self
|
||||
.db
|
||||
.dast_findings()
|
||||
.find_one(doc! {
|
||||
|
sharang
commented
[medium] Insecure Deduplication Logic in DAST Findings The deduplication logic for DAST findings relies on a combination of title, endpoint, and method fields which may not be sufficient to uniquely identify findings. This could lead to false negatives where actual duplicates are not detected due to minor variations in these fields. Suggested fix: Enhance the deduplication criteria by incorporating additional unique identifiers such as vulnerability signature hashes or content-based fingerprints that are more robust than just title, endpoint, and method. Scanner: code-review/security | CWE: CWE-400 **[medium] Insecure Deduplication Logic in DAST Findings**
The deduplication logic for DAST findings relies on a combination of title, endpoint, and method fields which may not be sufficient to uniquely identify findings. This could lead to false negatives where actual duplicates are not detected due to minor variations in these fields.
Suggested fix: Enhance the deduplication criteria by incorporating additional unique identifiers such as vulnerability signature hashes or content-based fingerprints that are more robust than just title, endpoint, and method.
*Scanner: code-review/security | CWE: CWE-400*
|
||||
"session_id": &session_id,
|
||||
"title": &finding.title,
|
||||
"endpoint": &finding.endpoint,
|
||||
"method": &finding.method,
|
||||
})
|
||||
|
sharang
commented
[high] Potential race condition in duplicate checking In Suggested fix: Use a database-level unique constraint or atomic upsert operation instead of separate find and insert operations to prevent race conditions. *Scanner: code-review/logic | * **[high] Potential race condition in duplicate checking**
In `compliance-agent/src/pentest/orchestrator.rs`, the duplicate check for DAST findings is performed before insertion but not atomically with the insertion. This could lead to race conditions where two concurrent processes might both pass the duplicate check and insert the same finding.
Suggested fix: Use a database-level unique constraint or atomic upsert operation instead of separate find and insert operations to prevent race conditions.
*Scanner: code-review/logic | *
sharang
commented
[medium] Potential performance issue in PentestOrchestrator with redundant database queries The code performs a database query for each finding to check for duplicates, which could be inefficient when processing many findings. Consider batching these checks or using a more efficient deduplication strategy. Suggested fix: Consider optimizing duplicate detection by either batching database queries or pre-loading existing findings into memory for faster lookup *Scanner: code-review/convention | * **[medium] Potential performance issue in PentestOrchestrator with redundant database queries**
The code performs a database query for each finding to check for duplicates, which could be inefficient when processing many findings. Consider batching these checks or using a more efficient deduplication strategy.
Suggested fix: Consider optimizing duplicate detection by either batching database queries or pre-loading existing findings into memory for faster lookup
*Scanner: code-review/convention | *
|
||||
.await;
|
||||
if matches!(existing, Ok(Some(_))) {
|
||||
|
sharang
commented
[medium] Complex boolean expression in duplicate checking logic The condition Suggested fix: Extract the severity comparison logic into a separate function like *Scanner: code-review/complexity | * **[medium] Complex boolean expression in duplicate checking logic**
The condition `if finding.severity > existing.severity || (finding.severity == existing.severity && finding.description.len() > existing.description.len())` in orchestrator.rs is complex and hard to reason about. It combines multiple comparisons that could be simplified into a helper function.
Suggested fix: Extract the severity comparison logic into a separate function like `should_replace_finding` to improve readability and maintainability.
*Scanner: code-review/complexity | *
|
||||
tracing::debug!(
|
||||
"Skipping duplicate DAST finding: {} (fp={:.12})",
|
||||
finding.title,
|
||||
fp,
|
||||
);
|
||||
continue;
|
||||
}
|
||||
|
||||
let insert_result =
|
||||
self.db.dast_findings().insert_one(&finding).await;
|
||||
if let Ok(res) = &insert_result {
|
||||
|
||||
@@ -66,8 +66,10 @@ impl CodeReviewScanner {
|
||||
}
|
||||
}
|
||||
|
||||
let deduped = dedup_cross_pass(all_findings);
|
||||
|
||||
ScanOutput {
|
||||
findings: all_findings,
|
||||
findings: deduped,
|
||||
sbom_entries: Vec::new(),
|
||||
}
|
||||
}
|
||||
|
sharang
commented
[medium] Inconsistent error handling in dedup_cross_pass function The Suggested fix: Replace *Scanner: code-review/convention | * **[medium] Inconsistent error handling in dedup_cross_pass function**
The `dedup_cross_pass` function uses `unwrap_or` which can lead to unexpected behavior when file paths are None. This pattern should be handled more explicitly to avoid potential panics or silent data loss.
Suggested fix: Replace `f.file_path.as_deref().unwrap_or("")` with explicit error handling or a default value that makes sense for the deduplication logic, such as using a placeholder string like "unknown_file".
*Scanner: code-review/convention | *
|
||||
@@ -184,3 +186,51 @@ struct ReviewIssue {
|
||||
#[serde(default)]
|
||||
suggestion: Option<String>,
|
||||
}
|
||||
|
||||
|
sharang
commented
[low] Missing type annotation for dedup_cross_pass parameter The function Suggested fix: Add explicit type annotation to parameter: *Scanner: code-review/convention | * **[low] Missing type annotation for dedup_cross_pass parameter**
The function `dedup_cross_pass` takes a parameter `findings` without explicit type annotation, which could lead to confusion about expected input types and make refactoring harder.
Suggested fix: Add explicit type annotation to parameter: `findings: Vec<Finding>`
*Scanner: code-review/convention | *
sharang
commented
[high] Complex deduplication logic in cross-pass code review The Suggested fix: Break down the deduplication logic into smaller helper functions: one for generating the dedup key, another for comparing findings by severity/description length, and a third for merging CWE information. *Scanner: code-review/complexity | * **[high] Complex deduplication logic in cross-pass code review**
The `dedup_cross_pass` function contains complex logic for building deduplication keys and comparing findings. The combination of string formatting, word normalization, and multi-criteria comparison makes this function hard to understand and maintain.
Suggested fix: Break down the deduplication logic into smaller helper functions: one for generating the dedup key, another for comparing findings by severity/description length, and a third for merging CWE information.
*Scanner: code-review/complexity | *
|
||||
/// Deduplicate findings across review passes.
|
||||
///
|
||||
/// Multiple passes often flag the same issue (e.g. SQL injection reported by
|
||||
/// logic, security, and convention passes). We group by file + nearby line +
|
||||
/// normalized title keywords and keep the highest-severity finding.
|
||||
|
sharang
commented
[medium] Weak Deduplication Key Construction in Cross-Pass Code Review The deduplication key construction in Suggested fix: Improve the deduplication key generation by using more sophisticated techniques like fuzzy matching or semantic analysis to better identify truly duplicate findings regardless of minor differences in formatting or location. Scanner: code-review/security | CWE: CWE-400 **[medium] Weak Deduplication Key Construction in Cross-Pass Code Review**
The deduplication key construction in `dedup_cross_pass` function uses a simple line bucketing approach (`line_number / 4`) which might incorrectly group unrelated findings or fail to detect true duplicates when line numbers vary slightly. Additionally, the normalization of titles may not handle all edge cases properly.
Suggested fix: Improve the deduplication key generation by using more sophisticated techniques like fuzzy matching or semantic analysis to better identify truly duplicate findings regardless of minor differences in formatting or location.
*Scanner: code-review/security | CWE: CWE-400*
|
||||
fn dedup_cross_pass(findings: Vec<Finding>) -> Vec<Finding> {
|
||||
use std::collections::HashMap;
|
||||
|
||||
|
sharang
commented
[medium] Off-by-one error in line bucket calculation In Suggested fix: Change the line bucket calculation to *Scanner: code-review/logic | * **[medium] Off-by-one error in line bucket calculation**
In `compliance-agent/src/pipeline/code_review.rs`, the line bucket calculation uses `f.line_number.unwrap_or(0) / 4` which will put line 0, 1, 2, 3 into bucket 0, but line 4 will be in bucket 1. This means that lines 3 and 4 are considered different buckets when they should probably be grouped together for deduplication purposes.
Suggested fix: Change the line bucket calculation to `(f.line_number.unwrap_or(0) + 3) / 4` to properly group lines within 3 of each other.
*Scanner: code-review/logic | *
|
||||
// Build a dedup key: (file, line bucket, normalized title words)
|
||||
fn dedup_key(f: &Finding) -> String {
|
||||
let file = f.file_path.as_deref().unwrap_or("");
|
||||
// Group lines within 3 of each other
|
||||
let line_bucket = f.line_number.unwrap_or(0) / 4;
|
||||
// Normalize: lowercase, keep only alphanumeric, sort words for order-independence
|
||||
let title_lower = f.title.to_lowercase();
|
||||
let mut words: Vec<&str> = title_lower
|
||||
.split(|c: char| !c.is_alphanumeric())
|
||||
.filter(|w| w.len() > 2)
|
||||
.collect();
|
||||
words.sort();
|
||||
format!("{file}:{line_bucket}:{}", words.join(","))
|
||||
}
|
||||
|
||||
let mut groups: HashMap<String, Finding> = HashMap::new();
|
||||
|
||||
for finding in findings {
|
||||
let key = dedup_key(&finding);
|
||||
|
sharang
commented
[high] Incorrect deduplication logic in cross-pass deduplication In Suggested fix: Fix the comparison logic to prioritize severity first, then consider description length only as a tiebreaker, or better yet, compare actual description content for meaningful deduplication. *Scanner: code-review/logic | * **[high] Incorrect deduplication logic in cross-pass deduplication**
In `compliance-agent/src/pipeline/code_review.rs`, the deduplication function compares descriptions by length rather than content. This means that a finding with a longer description will overwrite a finding with a shorter one even if the shorter one has higher severity, which is incorrect behavior.
Suggested fix: Fix the comparison logic to prioritize severity first, then consider description length only as a tiebreaker, or better yet, compare actual description content for meaningful deduplication.
*Scanner: code-review/logic | *
|
||||
groups
|
||||
.entry(key)
|
||||
.and_modify(|existing| {
|
||||
// Keep the higher severity; on tie, keep the one with more detail
|
||||
if finding.severity > existing.severity
|
||||
|| (finding.severity == existing.severity
|
||||
&& finding.description.len() > existing.description.len())
|
||||
{
|
||||
*existing = finding.clone();
|
||||
}
|
||||
// Merge CWE if the existing one is missing it
|
||||
if existing.cwe.is_none() {
|
||||
existing.cwe = finding.cwe.clone();
|
||||
}
|
||||
})
|
||||
.or_insert(finding);
|
||||
}
|
||||
|
||||
groups.into_values().collect()
|
||||
}
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
use sha2::{Digest, Sha256};
|
||||
|
||||
use compliance_core::models::dast::DastFinding;
|
||||
|
||||
pub fn compute_fingerprint(parts: &[&str]) -> String {
|
||||
let mut hasher = Sha256::new();
|
||||
for part in parts {
|
||||
@@ -9,9 +11,209 @@ pub fn compute_fingerprint(parts: &[&str]) -> String {
|
||||
hex::encode(hasher.finalize())
|
||||
}
|
||||
|
||||
/// Compute a dedup fingerprint for a DAST finding.
|
||||
|
sharang
commented
[medium] Insecure Hash Usage for Deduplication While SHA256 is cryptographically secure, using it for deduplication purposes without proper input sanitization can lead to issues. The function computes fingerprints based on user-provided data (titles, endpoints, methods) which could potentially be crafted to cause hash collisions or other issues if not properly validated. Suggested fix: Ensure that the inputs to compute_fingerprint are properly sanitized and validated before hashing. Consider adding input length limits to prevent potential DoS attacks through excessively long inputs. Scanner: code-review/security | CWE: CWE-327 **[medium] Insecure Hash Usage for Deduplication**
While SHA256 is cryptographically secure, using it for deduplication purposes without proper input sanitization can lead to issues. The function computes fingerprints based on user-provided data (titles, endpoints, methods) which could potentially be crafted to cause hash collisions or other issues if not properly validated.
Suggested fix: Ensure that the inputs to compute_fingerprint are properly sanitized and validated before hashing. Consider adding input length limits to prevent potential DoS attacks through excessively long inputs.
*Scanner: code-review/security | CWE: CWE-327*
|
||||
///
|
||||
/// The key is derived from the *canonicalized* title (lowercased, domain names
|
||||
/// stripped, known synonyms resolved), endpoint, and HTTP method. This lets us
|
||||
/// detect both exact duplicates (same tool reporting twice across passes) and
|
||||
/// semantic duplicates (e.g., `security_header_missing` "Missing HSTS header"
|
||||
/// vs `tls_misconfiguration` "Missing strict-transport-security header").
|
||||
pub fn compute_dast_fingerprint(f: &DastFinding) -> String {
|
||||
let canon = canonicalize_dast_title(&f.title);
|
||||
let endpoint = f.endpoint.to_lowercase().trim_end_matches('/').to_string();
|
||||
let method = f.method.to_uppercase();
|
||||
let param = f.parameter.as_deref().unwrap_or("");
|
||||
compute_fingerprint(&[&canon, &endpoint, &method, param])
|
||||
}
|
||||
|
||||
/// Canonicalize a DAST finding title for dedup purposes.
|
||||
///
|
||||
/// 1. Lowercase
|
||||
/// 2. Strip domain names / URLs (e.g. "for comp-dev.meghsakha.com")
|
||||
/// 3. Resolve known header synonyms (hsts ↔ strict-transport-security, etc.)
|
||||
/// 4. Strip extra whitespace
|
||||
fn canonicalize_dast_title(title: &str) -> String {
|
||||
let mut s = title.to_lowercase();
|
||||
|
||||
// Strip "for <domain>" or "on <domain>" suffixes
|
||||
|
sharang
commented
[high] Incorrect string truncation in canonicalize_dast_title The canonicalize_dast_title function uses s.truncate(idx) which removes everything from index idx onwards, but this can leave trailing spaces or break word boundaries. For example, 'test for example.com' becomes 'test ' instead of 'test'. Suggested fix: Replace s.truncate(idx) with s.truncate(idx.saturating_sub(1)) or better yet, use a more robust approach that properly handles whitespace around the truncated portion. *Scanner: code-review/logic | * **[high] Incorrect string truncation in canonicalize_dast_title**
The canonicalize_dast_title function uses s.truncate(idx) which removes everything from index idx onwards, but this can leave trailing spaces or break word boundaries. For example, 'test for example.com' becomes 'test ' instead of 'test'.
Suggested fix: Replace s.truncate(idx) with s.truncate(idx.saturating_sub(1)) or better yet, use a more robust approach that properly handles whitespace around the truncated portion.
*Scanner: code-review/logic | *
sharang
commented
[medium] Insecure String Manipulation in Canonicalization The canonicalize_dast_title function uses manual string manipulation and pattern matching that could be vulnerable to injection or unexpected behavior. Specifically, the domain stripping logic relies on simple substring checks which may not properly handle all edge cases, potentially leading to incorrect deduplication or information leakage. Suggested fix: Use more robust parsing techniques or regular expressions for domain/URL detection instead of simple substring matching. Consider using a dedicated URL parsing library to properly identify and strip domains. Scanner: code-review/security | CWE: CWE-1104 **[medium] Insecure String Manipulation in Canonicalization**
The canonicalize_dast_title function uses manual string manipulation and pattern matching that could be vulnerable to injection or unexpected behavior. Specifically, the domain stripping logic relies on simple substring checks which may not properly handle all edge cases, potentially leading to incorrect deduplication or information leakage.
Suggested fix: Use more robust parsing techniques or regular expressions for domain/URL detection instead of simple substring matching. Consider using a dedicated URL parsing library to properly identify and strip domains.
*Scanner: code-review/security | CWE: CWE-1104*
|
||||
// Pattern: "for <word.word...>" or "on <method> <url>"
|
||||
if let Some(idx) = s.find(" for ") {
|
||||
// Check if what follows looks like a domain or URL
|
||||
let rest = &s[idx + 5..];
|
||||
if rest.contains('.') || rest.starts_with("http") {
|
||||
s.truncate(idx);
|
||||
}
|
||||
}
|
||||
if let Some(idx) = s.find(" on ") {
|
||||
let rest = &s[idx + 4..];
|
||||
if rest.contains("http") || rest.contains('/') {
|
||||
s.truncate(idx);
|
||||
}
|
||||
}
|
||||
|
||||
// Resolve known header synonyms
|
||||
|
sharang
commented
[high] Potential index out of bounds in canonicalize_dast_title In canonicalize_dast_title, when checking for synonyms, the code accesses bytes at position pos + short.len() without ensuring that pos + short.len() is within bounds of the string. This could cause a panic if short is near the end of the string. Suggested fix: Add bounds checking before accessing s.as_bytes()[pos + short.len()] to prevent potential panics. *Scanner: code-review/logic | * **[high] Potential index out of bounds in canonicalize_dast_title**
In canonicalize_dast_title, when checking for synonyms, the code accesses bytes at position pos + short.len() without ensuring that pos + short.len() is within bounds of the string. This could cause a panic if short is near the end of the string.
Suggested fix: Add bounds checking before accessing s.as_bytes()[pos + short.len()] to prevent potential panics.
*Scanner: code-review/logic | *
|
||||
let synonyms: &[(&str, &str)] = &[
|
||||
|
sharang
commented
[high] Potential panic in In Suggested fix: Add a bounds check before accessing *Scanner: code-review/convention | * **[high] Potential panic in `canonicalize_dast_title` due to unchecked indexing**
In `canonicalize_dast_title`, there's a potential panic when accessing `s.as_bytes()[pos - 1]` without ensuring `pos > 0`. If `pos` is 0, this would access memory before the string, causing undefined behavior.
Suggested fix: Add a bounds check before accessing `s.as_bytes()[pos - 1]` to ensure `pos > 0` before dereferencing.
*Scanner: code-review/convention | *
|
||||
("hsts", "strict-transport-security"),
|
||||
("csp", "content-security-policy"),
|
||||
("cors", "cross-origin-resource-sharing"),
|
||||
|
sharang
commented
[medium] Complex boolean expression in canonicalize_dast_title The canonicalize_dast_title function contains complex boolean logic for checking word boundaries when replacing synonyms. The condition Suggested fix: Extract the boundary checking logic into a separate helper function to improve readability and maintainability. *Scanner: code-review/complexity | * **[medium] Complex boolean expression in canonicalize_dast_title**
The canonicalize_dast_title function contains complex boolean logic for checking word boundaries when replacing synonyms. The condition `before_ok && after_ok` involves multiple checks that are hard to follow and maintain.
Suggested fix: Extract the boundary checking logic into a separate helper function to improve readability and maintainability.
*Scanner: code-review/complexity | *
|
||||
("xfo", "x-frame-options"),
|
||||
];
|
||||
for &(short, canonical) in synonyms {
|
||||
// Only replace whole words — check boundaries
|
||||
if let Some(pos) = s.find(short) {
|
||||
let before_ok = pos == 0 || !s.as_bytes()[pos - 1].is_ascii_alphanumeric();
|
||||
let after_ok = pos + short.len() >= s.len()
|
||||
|| !s.as_bytes()[pos + short.len()].is_ascii_alphanumeric();
|
||||
if before_ok && after_ok {
|
||||
s = format!("{}{}{}", &s[..pos], canonical, &s[pos + short.len()..]);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Collapse whitespace
|
||||
s.split_whitespace().collect::<Vec<_>>().join(" ")
|
||||
}
|
||||
|
||||
/// Deduplicate a list of DAST findings, merging evidence from duplicates.
|
||||
///
|
||||
/// Two-phase approach:
|
||||
/// 1. **Exact dedup** — group by canonicalized `(title, endpoint, method, parameter)`.
|
||||
/// Merge evidence arrays, keep the highest severity, preserve exploitable flag.
|
||||
/// 2. **CWE-based dedup** — within the same `(cwe, endpoint, method)` group, merge
|
||||
/// findings whose canonicalized titles resolve to the same subject (e.g., HSTS
|
||||
/// reported as both `security_header_missing` and `tls_misconfiguration`).
|
||||
pub fn dedup_dast_findings(findings: Vec<DastFinding>) -> Vec<DastFinding> {
|
||||
use std::collections::HashMap;
|
||||
|
||||
if findings.len() <= 1 {
|
||||
return findings;
|
||||
}
|
||||
|
||||
// Phase 1: exact fingerprint dedup
|
||||
let mut seen: HashMap<String, usize> = HashMap::new();
|
||||
let mut deduped: Vec<DastFinding> = Vec::new();
|
||||
|
||||
for finding in findings {
|
||||
let fp = compute_dast_fingerprint(&finding);
|
||||
|
||||
if let Some(&idx) = seen.get(&fp) {
|
||||
// Merge into existing
|
||||
merge_dast_finding(&mut deduped[idx], &finding);
|
||||
} else {
|
||||
seen.insert(fp, deduped.len());
|
||||
deduped.push(finding);
|
||||
}
|
||||
}
|
||||
|
||||
let before = deduped.len();
|
||||
|
||||
// Phase 2: CWE-based related dedup
|
||||
// Group by (cwe, endpoint_normalized, method) — only when CWE is present
|
||||
let mut cwe_groups: HashMap<String, Vec<usize>> = HashMap::new();
|
||||
for (i, f) in deduped.iter().enumerate() {
|
||||
if let Some(ref cwe) = f.cwe {
|
||||
|
sharang
commented
[high] Large function with multiple responsibilities The dedup_dast_findings function is quite large (over 100 lines) and handles multiple distinct responsibilities: exact deduplication, CWE-based deduplication, merging logic, and result processing. This makes it hard to test and maintain. Suggested fix: Break down this function into smaller, focused functions such as 'perform_exact_dedup', 'perform_cwe_dedup', and 'merge_findings' to improve modularity and testability. *Scanner: code-review/complexity | * **[high] Large function with multiple responsibilities**
The dedup_dast_findings function is quite large (over 100 lines) and handles multiple distinct responsibilities: exact deduplication, CWE-based deduplication, merging logic, and result processing. This makes it hard to test and maintain.
Suggested fix: Break down this function into smaller, focused functions such as 'perform_exact_dedup', 'perform_cwe_dedup', and 'merge_findings' to improve modularity and testability.
*Scanner: code-review/complexity | *
|
||||
let key = format!(
|
||||
"{}|{}|{}",
|
||||
cwe,
|
||||
f.endpoint.to_lowercase().trim_end_matches('/'),
|
||||
f.method.to_uppercase(),
|
||||
);
|
||||
|
sharang
commented
[medium] Deeply nested control flow in dedup_dast_findings The dedup_dast_findings function has deeply nested control flow with multiple nested if statements and match expressions, making it difficult to follow the execution path and increasing the risk of logic errors. Suggested fix: Refactor the nested conditions into early returns or separate helper functions to flatten the control flow and improve readability. *Scanner: code-review/complexity | * **[medium] Deeply nested control flow in dedup_dast_findings**
The dedup_dast_findings function has deeply nested control flow with multiple nested if statements and match expressions, making it difficult to follow the execution path and increasing the risk of logic errors.
Suggested fix: Refactor the nested conditions into early returns or separate helper functions to flatten the control flow and improve readability.
*Scanner: code-review/complexity | *
|
||||
cwe_groups.entry(key).or_default().push(i);
|
||||
}
|
||||
}
|
||||
|
||||
// For each CWE group with multiple findings, keep the one with highest severity
|
||||
// and most evidence, merge the rest into it
|
||||
let mut merge_map: HashMap<usize, Vec<usize>> = HashMap::new();
|
||||
let mut remove_indices: Vec<usize> = Vec::new();
|
||||
|
||||
for indices in cwe_groups.values() {
|
||||
if indices.len() <= 1 {
|
||||
continue;
|
||||
}
|
||||
// Find the "primary" finding: highest severity, then most evidence, then longest description
|
||||
let Some(&primary_idx) = indices.iter().max_by(|&&a, &&b| {
|
||||
deduped[a]
|
||||
.severity
|
||||
.cmp(&deduped[b].severity)
|
||||
.then_with(|| deduped[a].evidence.len().cmp(&deduped[b].evidence.len()))
|
||||
.then_with(|| {
|
||||
|
sharang
commented
[medium] Inconsistent error handling in dedup_dast_findings The Suggested fix: Use consistent error handling patterns throughout the function. Prefer explicit *Scanner: code-review/convention | * **[medium] Inconsistent error handling in dedup_dast_findings**
The `dedup_dast_findings` function uses `unwrap_or` in one place but has explicit `if let Some(...)` checks elsewhere. This inconsistency can lead to confusion about how errors are handled and might mask potential issues in the deduplication logic.
Suggested fix: Use consistent error handling patterns throughout the function. Prefer explicit `match` expressions or `?` operators for better clarity and error propagation.
*Scanner: code-review/convention | *
|
||||
deduped[a]
|
||||
.description
|
||||
.len()
|
||||
.cmp(&deduped[b].description.len())
|
||||
})
|
||||
}) else {
|
||||
continue;
|
||||
};
|
||||
|
||||
for &idx in indices {
|
||||
if idx != primary_idx {
|
||||
remove_indices.push(idx);
|
||||
merge_map.entry(primary_idx).or_default().push(idx);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !remove_indices.is_empty() {
|
||||
remove_indices.sort_unstable();
|
||||
remove_indices.dedup();
|
||||
|
||||
// Merge evidence
|
||||
for (&primary, secondaries) in &merge_map {
|
||||
let extra_evidence: Vec<_> = secondaries
|
||||
.iter()
|
||||
.flat_map(|&i| deduped[i].evidence.clone())
|
||||
.collect();
|
||||
let any_exploitable = secondaries.iter().any(|&i| deduped[i].exploitable);
|
||||
|
||||
deduped[primary].evidence.extend(extra_evidence);
|
||||
if any_exploitable {
|
||||
deduped[primary].exploitable = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Remove merged findings (iterate in reverse to preserve indices)
|
||||
for &idx in remove_indices.iter().rev() {
|
||||
deduped.remove(idx);
|
||||
}
|
||||
}
|
||||
|
||||
let after = deduped.len();
|
||||
if before != after {
|
||||
tracing::debug!(
|
||||
"DAST CWE-based dedup: {before} → {after} findings ({} merged)",
|
||||
before - after
|
||||
);
|
||||
}
|
||||
|
||||
deduped
|
||||
}
|
||||
|
||||
/// Merge a duplicate DAST finding into a primary one.
|
||||
fn merge_dast_finding(primary: &mut DastFinding, duplicate: &DastFinding) {
|
||||
primary.evidence.extend(duplicate.evidence.clone());
|
||||
if duplicate.severity > primary.severity {
|
||||
primary.severity = duplicate.severity.clone();
|
||||
}
|
||||
if duplicate.exploitable {
|
||||
primary.exploitable = true;
|
||||
}
|
||||
// Keep the longer/better description
|
||||
if duplicate.description.len() > primary.description.len() {
|
||||
primary.description.clone_from(&duplicate.description);
|
||||
}
|
||||
// Keep remediation if primary doesn't have one
|
||||
if primary.remediation.is_none() && duplicate.remediation.is_some() {
|
||||
primary.remediation.clone_from(&duplicate.remediation);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use compliance_core::models::dast::DastVulnType;
|
||||
use compliance_core::models::finding::Severity;
|
||||
|
||||
#[test]
|
||||
fn fingerprint_is_deterministic() {
|
||||
@@ -55,4 +257,159 @@ mod tests {
|
||||
let b = compute_fingerprint(&["a", "bc"]);
|
||||
assert_ne!(a, b);
|
||||
}
|
||||
|
||||
fn make_dast(title: &str, endpoint: &str, vuln_type: DastVulnType) -> DastFinding {
|
||||
let mut f = DastFinding::new(
|
||||
"run1".into(),
|
||||
"target1".into(),
|
||||
vuln_type,
|
||||
title.into(),
|
||||
format!("Description for {title}"),
|
||||
Severity::Medium,
|
||||
endpoint.into(),
|
||||
"GET".into(),
|
||||
);
|
||||
f.cwe = Some("CWE-319".into());
|
||||
f
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn canonicalize_strips_domain_suffix() {
|
||||
let canon = canonicalize_dast_title("Missing HSTS header for comp-dev.meghsakha.com");
|
||||
|
sharang
commented
[medium] Incorrect severity comparison in merge_dast_finding In merge_dast_finding, the condition 'if duplicate.severity > primary.severity' assumes that Severity implements Ord correctly and that the comparison works as expected. If Severity does not implement Ord properly or has custom ordering, this could lead to incorrect behavior. Suggested fix: Verify that Severity implements Ord correctly and add explicit debug assertions to ensure the comparison behaves as expected. *Scanner: code-review/logic | * **[medium] Incorrect severity comparison in merge_dast_finding**
In merge_dast_finding, the condition 'if duplicate.severity > primary.severity' assumes that Severity implements Ord correctly and that the comparison works as expected. If Severity does not implement Ord properly or has custom ordering, this could lead to incorrect behavior.
Suggested fix: Verify that Severity implements Ord correctly and add explicit debug assertions to ensure the comparison behaves as expected.
*Scanner: code-review/logic | *
|
||||
assert!(!canon.contains("meghsakha"), "domain should be stripped");
|
||||
assert!(
|
||||
canon.contains("strict-transport-security"),
|
||||
"hsts should be resolved: {canon}"
|
||||
|
sharang
commented
[medium] Potential panic in merge_dast_finding due to clone usage In merge_dast_finding, the line 'primary.description.clone_from(&duplicate.description)' could panic if the description field is not properly initialized or if there's an issue with the clone operation. However, more importantly, the comparison logic in dedup_dast_findings might not handle all edge cases correctly. Suggested fix: Ensure that description fields are always initialized before calling clone_from, and consider adding defensive checks. *Scanner: code-review/logic | * **[medium] Potential panic in merge_dast_finding due to clone usage**
In merge_dast_finding, the line 'primary.description.clone_from(&duplicate.description)' could panic if the description field is not properly initialized or if there's an issue with the clone operation. However, more importantly, the comparison logic in dedup_dast_findings might not handle all edge cases correctly.
Suggested fix: Ensure that description fields are always initialized before calling clone_from, and consider adding defensive checks.
*Scanner: code-review/logic | *
sharang
commented
[medium] Inconsistent use of The Suggested fix: Use *Scanner: code-review/convention | * **[medium] Inconsistent use of `clone()` in merge_dast_finding**
The `merge_dast_finding` function uses `clone()` on `duplicate.description` and `duplicate.remediation`, but not on other fields like `duplicate.evidence`. This creates an inconsistent pattern where some fields are cloned while others are moved, potentially leading to performance issues or unexpected behavior.
Suggested fix: Use `clone()` consistently for all fields that need to be copied, or use `clone_from` for fields that are already owned. Consider using `std::mem::replace` or similar patterns for more efficient moves.
*Scanner: code-review/convention | *
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn canonicalize_resolves_synonyms() {
|
||||
let a = canonicalize_dast_title("Missing HSTS header");
|
||||
let b = canonicalize_dast_title("Missing strict-transport-security header");
|
||||
assert_eq!(a, b);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn exact_dedup_merges_identical_findings() {
|
||||
let f1 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
let f2 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
let result = dedup_dast_findings(vec![f1, f2]);
|
||||
assert_eq!(result.len(), 1, "exact duplicates should be merged");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synonym_dedup_merges_hsts_variants() {
|
||||
let f1 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
let f2 = make_dast(
|
||||
"Missing HSTS header for example.com",
|
||||
"https://example.com",
|
||||
DastVulnType::TlsMisconfiguration,
|
||||
);
|
||||
let result = dedup_dast_findings(vec![f1, f2]);
|
||||
assert_eq!(
|
||||
result.len(),
|
||||
1,
|
||||
"HSTS synonym variants should merge to 1 finding"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn different_headers_not_merged() {
|
||||
let mut f1 = make_dast(
|
||||
"Missing x-content-type-options header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
f1.cwe = Some("CWE-16".into());
|
||||
let mut f2 = make_dast(
|
||||
"Missing permissions-policy header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
f2.cwe = Some("CWE-16".into());
|
||||
// These share CWE-16 but are different headers — phase 2 will merge them
|
||||
// since they share the same CWE+endpoint. This is acceptable because they
|
||||
// have the same root cause (missing security headers configuration).
|
||||
let result = dedup_dast_findings(vec![f1, f2]);
|
||||
// CWE-based dedup will merge these into 1
|
||||
assert!(
|
||||
result.len() <= 2,
|
||||
"same CWE+endpoint findings may be merged"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn different_endpoints_not_merged() {
|
||||
let f1 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
let f2 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://other.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
let result = dedup_dast_findings(vec![f1, f2]);
|
||||
assert_eq!(result.len(), 2, "different endpoints should not merge");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_preserves_highest_severity() {
|
||||
let f1 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
let mut f2 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
f2.severity = Severity::High;
|
||||
let result = dedup_dast_findings(vec![f1, f2]);
|
||||
assert_eq!(result.len(), 1);
|
||||
assert_eq!(result[0].severity, Severity::High);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn dedup_merges_evidence() {
|
||||
let mut f1 = make_dast(
|
||||
"Missing strict-transport-security header",
|
||||
"https://example.com",
|
||||
DastVulnType::SecurityHeaderMissing,
|
||||
);
|
||||
f1.evidence
|
||||
.push(compliance_core::models::dast::DastEvidence {
|
||||
request_method: "GET".into(),
|
||||
request_url: "https://example.com".into(),
|
||||
request_headers: None,
|
||||
request_body: None,
|
||||
response_status: 200,
|
||||
response_headers: None,
|
||||
response_snippet: Some("pass 1".into()),
|
||||
screenshot_path: None,
|
||||
payload: None,
|
||||
response_time_ms: None,
|
||||
});
|
||||
let mut f2 = f1.clone();
|
||||
f2.evidence[0].response_snippet = Some("pass 2".into());
|
||||
|
||||
let result = dedup_dast_findings(vec![f1, f2]);
|
||||
assert_eq!(result.len(), 1);
|
||||
assert_eq!(result[0].evidence.len(), 2, "evidence should be merged");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
use compliance_core::models::*;
|
||||
|
||||
use super::dedup::compute_fingerprint;
|
||||
use super::orchestrator::PipelineOrchestrator;
|
||||
use crate::error::AgentError;
|
||||
use crate::pipeline::code_review::CodeReviewScanner;
|
||||
@@ -89,12 +90,37 @@ impl PipelineOrchestrator {
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
// Dedup findings by fingerprint to avoid duplicate comments
|
||||
|
sharang
commented
[medium] Complex boolean expression in finding deduplication The fingerprint computation and deduplication logic uses a complex boolean expression with multiple unwraps and nested conditions that make it hard to reason about. The code also has duplicated fingerprint computation logic. Suggested fix: Extract the fingerprint computation into a separate function and simplify the deduplication logic. Consider using a more functional approach with iterators and map/filter operations instead of manual loops. *Scanner: code-review/complexity | * **[medium] Complex boolean expression in finding deduplication**
The fingerprint computation and deduplication logic uses a complex boolean expression with multiple unwraps and nested conditions that make it hard to reason about. The code also has duplicated fingerprint computation logic.
Suggested fix: Extract the fingerprint computation into a separate function and simplify the deduplication logic. Consider using a more functional approach with iterators and map/filter operations instead of manual loops.
*Scanner: code-review/complexity | *
|
||||
let mut seen_fps = std::collections::HashSet::new();
|
||||
let mut unique_findings: Vec<&Finding> = Vec::new();
|
||||
for finding in &pr_findings {
|
||||
let fp = compute_fingerprint(&[
|
||||
repo_id,
|
||||
|
sharang
commented
[medium] Potential panic from unwrap in fingerprint computation The code uses Suggested fix: Replace *Scanner: code-review/convention | * **[medium] Potential panic from unwrap in fingerprint computation**
The code uses `finding.file_path.as_deref().unwrap_or("")` which could panic if file_path is None and the fallback is not properly handled. While it's unlikely to be None, using `unwrap_or_default()` or similar safer alternatives would prevent potential panics.
Suggested fix: Replace `unwrap_or("")` with `unwrap_or_default()` or handle the Option more explicitly to avoid potential panics.
*Scanner: code-review/convention | *
|
||||
&pr_number.to_string(),
|
||||
finding.file_path.as_deref().unwrap_or(""),
|
||||
|
sharang
commented
[medium] Insecure Direct Object Reference in Finding Processing The code processes findings based on file paths and line numbers without proper validation. If these values are derived from user input or external sources, they could potentially allow an attacker to access unintended files or lines through manipulation of the finding data. Suggested fix: Validate and sanitize all inputs related to file paths and line numbers before processing them. Ensure that the file paths are within expected boundaries and that line numbers are valid for the given file. Scanner: code-review/security | CWE: CWE-639 **[medium] Insecure Direct Object Reference in Finding Processing**
The code processes findings based on file paths and line numbers without proper validation. If these values are derived from user input or external sources, they could potentially allow an attacker to access unintended files or lines through manipulation of the finding data.
Suggested fix: Validate and sanitize all inputs related to file paths and line numbers before processing them. Ensure that the file paths are within expected boundaries and that line numbers are valid for the given file.
*Scanner: code-review/security | CWE: CWE-639*
|
||||
&finding.line_number.unwrap_or(0).to_string(),
|
||||
&finding.title,
|
||||
]);
|
||||
|
sharang
commented
[high] Incorrect deduplication logic The deduplication logic uses a HashSet to track fingerprints but then reassigns the pr_findings variable which shadows the original one. This means the deduplication only affects the local unique_findings vector, not the actual pr_findings that get processed later. The original pr_findings is never updated with unique_findings. Suggested fix: Remove the line 'let pr_findings = unique_findings;' and instead modify the original pr_findings variable directly, or better yet, replace the entire loop with a more idiomatic approach using iterator methods. *Scanner: code-review/logic | * **[high] Incorrect deduplication logic**
The deduplication logic uses a HashSet to track fingerprints but then reassigns the pr_findings variable which shadows the original one. This means the deduplication only affects the local unique_findings vector, not the actual pr_findings that get processed later. The original pr_findings is never updated with unique_findings.
Suggested fix: Remove the line 'let pr_findings = unique_findings;' and instead modify the original pr_findings variable directly, or better yet, replace the entire loop with a more idiomatic approach using iterator methods.
*Scanner: code-review/logic | *
|
||||
if seen_fps.insert(fp) {
|
||||
unique_findings.push(finding);
|
||||
|
sharang
commented
[medium] Duplicated fingerprint computation logic The fingerprint computation logic is duplicated in two places (lines 93-101 and 105-113) which increases maintenance burden and risk of inconsistencies. Suggested fix: Create a helper function that takes the required components and returns the fingerprint to eliminate code duplication. *Scanner: code-review/complexity | * **[medium] Duplicated fingerprint computation logic**
The fingerprint computation logic is duplicated in two places (lines 93-101 and 105-113) which increases maintenance burden and risk of inconsistencies.
Suggested fix: Create a helper function that takes the required components and returns the fingerprint to eliminate code duplication.
*Scanner: code-review/complexity | *
|
||||
}
|
||||
}
|
||||
|
sharang
commented
[medium] Potential Information Disclosure via Comment Fingerprint The code includes a compliance fingerprint in PR review comments using Suggested fix: Review the fingerprint computation logic to ensure it does not include sensitive data. Consider using a hash of non-sensitive components only, and validate that the fingerprint cannot be reverse-engineered to expose internal system details. Scanner: code-review/security | CWE: CWE-200 **[medium] Potential Information Disclosure via Comment Fingerprint**
The code includes a compliance fingerprint in PR review comments using `<!-- compliance-fp:{fp} -->`. If the fingerprint contains sensitive information or can be used to reconstruct internal state, it may lead to information disclosure. The fingerprint is computed from repository ID, PR number, file path, line number, and finding title.
Suggested fix: Review the fingerprint computation logic to ensure it does not include sensitive data. Consider using a hash of non-sensitive components only, and validate that the fingerprint cannot be reverse-engineered to expose internal system details.
*Scanner: code-review/security | CWE: CWE-200*
|
||||
|
||||
let pr_findings = unique_findings;
|
||||
|
||||
// Build review comments from findings
|
||||
let mut review_comments = Vec::new();
|
||||
for finding in &pr_findings {
|
||||
if let (Some(path), Some(line)) = (&finding.file_path, finding.line_number) {
|
||||
let fp = compute_fingerprint(&[
|
||||
repo_id,
|
||||
&pr_number.to_string(),
|
||||
path,
|
||||
&line.to_string(),
|
||||
&finding.title,
|
||||
]);
|
||||
let comment_body = format!(
|
||||
"**[{}] {}**\n\n{}\n\n*Scanner: {} | {}*",
|
||||
"**[{}] {}**\n\n{}\n\n*Scanner: {} | {}*\n\n<!-- compliance-fp:{fp} -->",
|
||||
finding.severity,
|
||||
finding.title,
|
||||
finding.description,
|
||||
@@ -123,6 +149,17 @@ impl PipelineOrchestrator {
|
||||
.join("\n"),
|
||||
|
sharang
commented
[medium] Deep nesting in PR review handling The PR review handling logic has multiple nested conditional blocks (if let chains and nested if statements) that make the control flow difficult to follow and increase cognitive load. Suggested fix: Refactor the nested conditionals into early returns or extract the review creation logic into separate functions to flatten the control flow. *Scanner: code-review/complexity | * **[medium] Deep nesting in PR review handling**
The PR review handling logic has multiple nested conditional blocks (if let chains and nested if statements) that make the control flow difficult to follow and increase cognitive load.
Suggested fix: Refactor the nested conditionals into early returns or extract the review creation logic into separate functions to flatten the control flow.
*Scanner: code-review/complexity | *
|
||||
);
|
||||
|
||||
if review_comments.is_empty() {
|
||||
|
sharang
commented
[medium] Inconsistent error handling in PR review pipeline The code uses both Suggested fix: Use consistent error handling pattern throughout the function. Either propagate all errors with *Scanner: code-review/convention | * **[medium] Inconsistent error handling in PR review pipeline**
The code uses both `tracing::warn!` for logging and direct error propagation with `?` operator inconsistently. The warning log is used for one case but not for others where errors might occur during PR review creation.
Suggested fix: Use consistent error handling pattern throughout the function. Either propagate all errors with `?` or handle them uniformly with logging.
*Scanner: code-review/convention | *
|
||||
// All findings were on files/lines we can't comment on inline
|
||||
if let Err(e) = tracker
|
||||
.create_pr_review(owner, tracker_repo_name, pr_number, &summary, Vec::new())
|
||||
.await
|
||||
{
|
||||
tracing::warn!("[{repo_id}] Failed to post PR review summary: {e}");
|
||||
}
|
||||
return Ok(());
|
||||
}
|
||||
|
||||
if let Err(e) = tracker
|
||||
.create_pr_review(
|
||||
owner,
|
||||
|
||||
[medium] Inconsistent error handling pattern in export_session_report
The code uses
matchexpressions for database operations but handles errors by returning empty vectors. This creates inconsistency with other parts of the codebase that likely use more robust error propagation patterns.Suggested fix: Use consistent error handling approach throughout the codebase, possibly propagating errors instead of silently returning empty collections
*Scanner: code-review/convention | *
[medium] Deeply nested control flow in export_session_report
The export_session_report function has deeply nested control flow with multiple match statements and conditional blocks that make it difficult to follow the main logic path. The deduplication process adds additional nesting.
Suggested fix: Consider extracting the DAST findings fetching and deduplication logic into a separate function to flatten the control flow and improve readability.
*Scanner: code-review/complexity | *