feat: deduplicate code review findings across LLM passes #48
@@ -66,8 +66,10 @@ impl CodeReviewScanner {
|
||||
}
|
||||
}
|
||||
|
||||
let deduped = dedup_cross_pass(all_findings);
|
||||
|
||||
ScanOutput {
|
||||
findings: all_findings,
|
||||
findings: deduped,
|
||||
sbom_entries: Vec::new(),
|
||||
}
|
||||
}
|
||||
|
|
||||
@@ -184,3 +186,51 @@ struct ReviewIssue {
|
||||
#[serde(default)]
|
||||
suggestion: Option<String>,
|
||||
}
|
||||
|
||||
|
sharang
commented
[low] Missing type annotation for dedup_cross_pass parameter The function Suggested fix: Add explicit type annotation to parameter: *Scanner: code-review/convention | * **[low] Missing type annotation for dedup_cross_pass parameter**
The function `dedup_cross_pass` takes a parameter `findings` without explicit type annotation, which could lead to confusion about expected input types and make refactoring harder.
Suggested fix: Add explicit type annotation to parameter: `findings: Vec<Finding>`
*Scanner: code-review/convention | *
sharang
commented
[high] Complex deduplication logic in cross-pass code review The Suggested fix: Break down the deduplication logic into smaller helper functions: one for generating the dedup key, another for comparing findings by severity/description length, and a third for merging CWE information. *Scanner: code-review/complexity | * **[high] Complex deduplication logic in cross-pass code review**
The `dedup_cross_pass` function contains complex logic for building deduplication keys and comparing findings. The combination of string formatting, word normalization, and multi-criteria comparison makes this function hard to understand and maintain.
Suggested fix: Break down the deduplication logic into smaller helper functions: one for generating the dedup key, another for comparing findings by severity/description length, and a third for merging CWE information.
*Scanner: code-review/complexity | *
|
||||
/// Deduplicate findings across review passes.
|
||||
///
|
||||
/// Multiple passes often flag the same issue (e.g. SQL injection reported by
|
||||
/// logic, security, and convention passes). We group by file + nearby line +
|
||||
/// normalized title keywords and keep the highest-severity finding.
|
||||
|
sharang
commented
[medium] Weak Deduplication Key Construction in Cross-Pass Code Review The deduplication key construction in Suggested fix: Improve the deduplication key generation by using more sophisticated techniques like fuzzy matching or semantic analysis to better identify truly duplicate findings regardless of minor differences in formatting or location. Scanner: code-review/security | CWE: CWE-400 **[medium] Weak Deduplication Key Construction in Cross-Pass Code Review**
The deduplication key construction in `dedup_cross_pass` function uses a simple line bucketing approach (`line_number / 4`) which might incorrectly group unrelated findings or fail to detect true duplicates when line numbers vary slightly. Additionally, the normalization of titles may not handle all edge cases properly.
Suggested fix: Improve the deduplication key generation by using more sophisticated techniques like fuzzy matching or semantic analysis to better identify truly duplicate findings regardless of minor differences in formatting or location.
*Scanner: code-review/security | CWE: CWE-400*
|
||||
fn dedup_cross_pass(findings: Vec<Finding>) -> Vec<Finding> {
|
||||
use std::collections::HashMap;
|
||||
|
||||
|
sharang
commented
[medium] Off-by-one error in line bucket calculation In Suggested fix: Change the line bucket calculation to *Scanner: code-review/logic | * **[medium] Off-by-one error in line bucket calculation**
In `compliance-agent/src/pipeline/code_review.rs`, the line bucket calculation uses `f.line_number.unwrap_or(0) / 4` which will put line 0, 1, 2, 3 into bucket 0, but line 4 will be in bucket 1. This means that lines 3 and 4 are considered different buckets when they should probably be grouped together for deduplication purposes.
Suggested fix: Change the line bucket calculation to `(f.line_number.unwrap_or(0) + 3) / 4` to properly group lines within 3 of each other.
*Scanner: code-review/logic | *
|
||||
// Build a dedup key: (file, line bucket, normalized title words)
|
||||
fn dedup_key(f: &Finding) -> String {
|
||||
let file = f.file_path.as_deref().unwrap_or("");
|
||||
// Group lines within 3 of each other
|
||||
let line_bucket = f.line_number.unwrap_or(0) / 4;
|
||||
// Normalize: lowercase, keep only alphanumeric, sort words for order-independence
|
||||
let title_lower = f.title.to_lowercase();
|
||||
let mut words: Vec<&str> = title_lower
|
||||
.split(|c: char| !c.is_alphanumeric())
|
||||
.filter(|w| w.len() > 2)
|
||||
.collect();
|
||||
words.sort();
|
||||
format!("{file}:{line_bucket}:{}", words.join(","))
|
||||
}
|
||||
|
||||
let mut groups: HashMap<String, Finding> = HashMap::new();
|
||||
|
||||
for finding in findings {
|
||||
let key = dedup_key(&finding);
|
||||
|
sharang
commented
[high] Incorrect deduplication logic in cross-pass deduplication In Suggested fix: Fix the comparison logic to prioritize severity first, then consider description length only as a tiebreaker, or better yet, compare actual description content for meaningful deduplication. *Scanner: code-review/logic | * **[high] Incorrect deduplication logic in cross-pass deduplication**
In `compliance-agent/src/pipeline/code_review.rs`, the deduplication function compares descriptions by length rather than content. This means that a finding with a longer description will overwrite a finding with a shorter one even if the shorter one has higher severity, which is incorrect behavior.
Suggested fix: Fix the comparison logic to prioritize severity first, then consider description length only as a tiebreaker, or better yet, compare actual description content for meaningful deduplication.
*Scanner: code-review/logic | *
|
||||
groups
|
||||
.entry(key)
|
||||
.and_modify(|existing| {
|
||||
// Keep the higher severity; on tie, keep the one with more detail
|
||||
if finding.severity > existing.severity
|
||||
|| (finding.severity == existing.severity
|
||||
&& finding.description.len() > existing.description.len())
|
||||
{
|
||||
*existing = finding.clone();
|
||||
}
|
||||
// Merge CWE if the existing one is missing it
|
||||
if existing.cwe.is_none() {
|
||||
existing.cwe = finding.cwe.clone();
|
||||
}
|
||||
})
|
||||
.or_insert(finding);
|
||||
}
|
||||
|
||||
groups.into_values().collect()
|
||||
}
|
||||
|
||||
[medium] Inconsistent error handling in dedup_cross_pass function
The
dedup_cross_passfunction usesunwrap_orwhich can lead to unexpected behavior when file paths are None. This pattern should be handled more explicitly to avoid potential panics or silent data loss.Suggested fix: Replace
f.file_path.as_deref().unwrap_or("")with explicit error handling or a default value that makes sense for the deduplication logic, such as using a placeholder string like "unknown_file".*Scanner: code-review/convention | *