e3aabe7d18
CI / Check (pull_request) Successful in 8m40s
CI / Detect Changes (pull_request) Has been skipped
CI / Deploy Agent (pull_request) Has been skipped
CI / Deploy Dashboard (pull_request) Has been skipped
CI / Deploy Docs (pull_request) Has been skipped
CI / Deploy MCP (pull_request) Has been skipped
First slice of the M7.2 tenant-isolation work. Adds a `DatabasePool`
that hands out per-tenant `Database` handles physically scoped to
`<prefix>_<tenant_id>` Mongo databases. Isolation is at the driver,
not at "we hope we filter" — a handle for tenant A literally cannot
see tenant B's documents because it's connected to a different db.
What's in this PR
- DatabasePool::connect — pings the cluster, prepares per-tenant lazy
handles.
- DatabasePool::for_tenant(&TenantContext) — returns a Database scoped
to that tenant. ensure_indexes runs once per tenant per process via
a DashMap-backed marker; failure rolls the marker back so the next
request retries.
- tenant_db_name — `<prefix>_<sanitized_tenant_id>` if it fits in
Mongo's 63-byte db-name cap, else `<prefix>_<sha256-16hex>` fallback.
- Sanitizer rewrites the Mongo-disallowed chars (`/ \ . " $ <space>
NUL`) so any future tenant_id shape works.
- ComplianceAgent gains a `db_pool: DatabasePool` field next to the
existing `db: Database`. Handlers / pipelines / webhooks still use
`db` — they migrate to `db_pool.for_tenant(&ctx)` in M7.2-B/C and
`db` goes away in M7.2-D.
Test plan
- cargo fmt --all clean
- cargo clippy --workspace --exclude compliance-dashboard -- -D warnings
clean
- cargo test -p compliance-core --lib — 7 pass
- cargo test -p compliance-agent --lib — 228 pass
- cargo test -p compliance-agent --test tenant_isolation — 4 pass
against live mongo on 27017:
* pool_isolates_tenants_at_driver_level — writes for acme + globex,
reads through each tenant's handle; each sees exactly its own
data with no filter doc anywhere.
* for_tenant_is_idempotent_index_creation — second + third call
for the same tenant do not error.
* tenant_db_name_sanitizes_unsafe_characters
* tenant_db_name_falls_back_to_hash_when_too_long — 100-byte
tenant_id collapses to a stable 8-byte hex suffix.
Why per-tenant DB vs `tenant_id` field + filter
- Driver-level isolation; impossible to forget the filter on one of
the 184 query call-sites in compliance-agent.
- Handlers don't change shape at migration — `agent.db.findings()`
becomes `db.findings()` after pulling `db` from
`agent.db_pool.for_tenant(&ctx)`.
- GDPR delete = `db.dropDatabase()`.
- On-prem deploy = the same code path, with one tenant.
- Trade-off accepted: index storage duplicated per tenant; Mongo's
~thousand-db ceiling is way above the 10s-100s tenants we're
targeting.
Caveats
- Existing `agent.db` continues to point at the single legacy db.
Handlers / pipelines that use it are unscoped until M7.2-B/C
migrate them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
438 lines
14 KiB
Rust
438 lines
14 KiB
Rust
use std::sync::Arc;
|
|
|
|
use dashmap::DashMap;
|
|
use mongodb::bson::doc;
|
|
use mongodb::options::IndexOptions;
|
|
use mongodb::{Client, Collection, IndexModel};
|
|
use sha2::{Digest, Sha256};
|
|
|
|
use compliance_core::models::*;
|
|
use compliance_core::TenantContext;
|
|
|
|
use crate::error::AgentError;
|
|
|
|
/// Mongo enforces a 63-byte cap on database names (older clusters: 64
|
|
/// on Linux, 63 on Windows; we target the conservative limit).
|
|
const MAX_DB_NAME_LEN: usize = 63;
|
|
|
|
/// Per-tenant Mongo connection broker (M7.2 isolation model).
|
|
///
|
|
/// Holds one [`Client`] and hands out [`Database`] handles physically
|
|
/// scoped to `<db_prefix>_<tenant_id>`. The driver is the isolation
|
|
/// boundary — a handle for tenant A cannot see tenant B's documents
|
|
/// because it is connected to a different database, not because of an
|
|
/// application-level filter.
|
|
///
|
|
/// Index creation runs idempotently the first time each tenant is seen
|
|
/// in the process's lifetime. Mongo's `createIndex` is itself idempotent
|
|
/// by index name; the in-memory `ensured` set just skips the round-trip.
|
|
#[derive(Clone, Debug)]
|
|
pub struct DatabasePool {
|
|
client: Client,
|
|
db_prefix: String,
|
|
ensured: Arc<DashMap<String, ()>>,
|
|
}
|
|
|
|
impl DatabasePool {
|
|
/// Connect to the cluster and prepare to hand out tenant databases
|
|
/// named `<db_prefix>_<tenant_id>`.
|
|
pub async fn connect(uri: &str, db_prefix: &str) -> Result<Self, AgentError> {
|
|
let client = Client::with_uri_str(uri).await?;
|
|
client
|
|
.database("admin")
|
|
.run_command(doc! { "ping": 1 })
|
|
.await?;
|
|
tracing::info!(
|
|
"MongoDB cluster reachable; per-tenant pool ready (db prefix '{db_prefix}')"
|
|
);
|
|
Ok(Self {
|
|
client,
|
|
db_prefix: db_prefix.to_string(),
|
|
ensured: Arc::new(DashMap::new()),
|
|
})
|
|
}
|
|
|
|
/// Return a [`Database`] scoped to this tenant. Ensures indexes on
|
|
/// first call per tenant (per process). Cheap on the hot path —
|
|
/// subsequent calls skip the round-trip.
|
|
pub async fn for_tenant(&self, ctx: &TenantContext) -> Result<Database, AgentError> {
|
|
let db_name = self.tenant_db_name(&ctx.tenant_id);
|
|
let db = Database::from_database(self.client.database(&db_name));
|
|
// `DashMap::insert` returns the previous value; `None` means we
|
|
// were the first writer for this tenant_id and own the
|
|
// index-ensure work.
|
|
if self.ensured.insert(ctx.tenant_id.clone(), ()).is_none() {
|
|
if let Err(e) = db.ensure_indexes().await {
|
|
// Roll the marker back so the next request retries.
|
|
self.ensured.remove(&ctx.tenant_id);
|
|
return Err(e);
|
|
}
|
|
tracing::debug!(
|
|
tenant_id = %ctx.tenant_id,
|
|
db_name = %db_name,
|
|
"Indexes ensured for tenant database"
|
|
);
|
|
}
|
|
Ok(db)
|
|
}
|
|
|
|
/// Compute the Mongo database name for a tenant. Public for tests
|
|
/// and tenant offboarding (`pool.client().database(name).drop()`).
|
|
///
|
|
/// Format: `<prefix>_<sanitized_tenant_id>` if it fits in 63 chars,
|
|
/// otherwise `<prefix>_<sha256-16hex-of-tenant_id>`. The hash
|
|
/// fallback is collision-resistant in practice (2^64 keyspace)
|
|
/// while keeping the name bounded.
|
|
pub fn tenant_db_name(&self, tenant_id: &str) -> String {
|
|
let sanitized = sanitize_tenant_id(tenant_id);
|
|
let natural = format!("{}_{}", self.db_prefix, sanitized);
|
|
if natural.len() <= MAX_DB_NAME_LEN {
|
|
natural
|
|
} else {
|
|
let mut hasher = Sha256::new();
|
|
hasher.update(tenant_id.as_bytes());
|
|
let digest = hasher.finalize();
|
|
// 16 hex chars = 8 bytes = 64-bit truncation.
|
|
let suffix = hex::encode(&digest[..8]);
|
|
let hashed = format!("{}_{}", self.db_prefix, suffix);
|
|
debug_assert!(hashed.len() <= MAX_DB_NAME_LEN);
|
|
hashed
|
|
}
|
|
}
|
|
|
|
/// Raw client handle. Reserved for cross-tenant admin flows that
|
|
/// must opt in explicitly (tenant listing, drop-on-offboard).
|
|
pub fn client(&self) -> &Client {
|
|
&self.client
|
|
}
|
|
}
|
|
|
|
/// Mongo database names disallow `/`, `\`, `.`, `"`, `$`, ` `, and NUL.
|
|
/// breakpilot-dev tenant_ids are UUIDs so this is belt-and-braces, but
|
|
/// it lets the pool tolerate any future tenant_id shape without surprise.
|
|
fn sanitize_tenant_id(tenant_id: &str) -> String {
|
|
tenant_id
|
|
.chars()
|
|
.map(|c| match c {
|
|
'/' | '\\' | '.' | '"' | '$' | ' ' | '\0' => '_',
|
|
c => c,
|
|
})
|
|
.collect()
|
|
}
|
|
|
|
#[derive(Clone, Debug)]
|
|
pub struct Database {
|
|
inner: mongodb::Database,
|
|
}
|
|
|
|
impl Database {
|
|
pub async fn connect(uri: &str, db_name: &str) -> Result<Self, AgentError> {
|
|
let client = Client::with_uri_str(uri).await?;
|
|
let db = client.database(db_name);
|
|
db.run_command(doc! { "ping": 1 }).await?;
|
|
tracing::info!("Connected to MongoDB database '{db_name}'");
|
|
Ok(Self { inner: db })
|
|
}
|
|
|
|
/// Wrap an already-resolved Mongo database. Used by [`DatabasePool`]
|
|
/// to hand out tenant-scoped handles without a fresh client per tenant.
|
|
pub(crate) fn from_database(inner: mongodb::Database) -> Self {
|
|
Self { inner }
|
|
}
|
|
|
|
pub async fn ensure_indexes(&self) -> Result<(), AgentError> {
|
|
// repositories: unique git_url
|
|
self.repositories()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "git_url": 1 })
|
|
.options(IndexOptions::builder().unique(true).build())
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// findings: unique fingerprint
|
|
self.findings()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "fingerprint": 1 })
|
|
.options(IndexOptions::builder().unique(true).build())
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// findings: repo_id + severity compound
|
|
self.findings()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "severity": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// scan_runs: repo_id + started_at descending
|
|
self.scan_runs()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "started_at": -1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// sbom_entries: compound
|
|
self.sbom_entries()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "name": 1, "version": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// cve_alerts: unique cve_id + repo_id
|
|
self.cve_alerts()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "cve_id": 1, "repo_id": 1 })
|
|
.options(IndexOptions::builder().unique(true).build())
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// cve_notifications: unique cve_id + repo_id + package, status filter
|
|
self.cve_notifications()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(
|
|
doc! { "cve_id": 1, "repo_id": 1, "package_name": 1, "package_version": 1 },
|
|
)
|
|
.options(IndexOptions::builder().unique(true).build())
|
|
.build(),
|
|
)
|
|
.await?;
|
|
self.cve_notifications()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "status": 1, "created_at": -1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// tracker_issues: unique finding_id
|
|
self.tracker_issues()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "finding_id": 1 })
|
|
.options(IndexOptions::builder().unique(true).build())
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// graph_nodes: compound (repo_id, graph_build_id)
|
|
self.graph_nodes()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "graph_build_id": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// graph_edges: compound (repo_id, graph_build_id)
|
|
self.graph_edges()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "graph_build_id": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// graph_builds: compound (repo_id, started_at DESC)
|
|
self.graph_builds()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "started_at": -1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// impact_analyses: unique (repo_id, finding_id)
|
|
self.impact_analyses()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "finding_id": 1 })
|
|
.options(IndexOptions::builder().unique(true).build())
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// dast_targets: index on repo_id
|
|
self.dast_targets()
|
|
.create_index(IndexModel::builder().keys(doc! { "repo_id": 1 }).build())
|
|
.await?;
|
|
|
|
// dast_scan_runs: compound (target_id, started_at DESC)
|
|
self.dast_scan_runs()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "target_id": 1, "started_at": -1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// dast_findings: compound (scan_run_id, vuln_type)
|
|
self.dast_findings()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "scan_run_id": 1, "vuln_type": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// code_embeddings: compound (repo_id, graph_build_id)
|
|
self.code_embeddings()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "graph_build_id": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// embedding_builds: compound (repo_id, started_at DESC)
|
|
self.embedding_builds()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "repo_id": 1, "started_at": -1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// pentest_sessions: compound (target_id, started_at DESC)
|
|
self.pentest_sessions()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "target_id": 1, "started_at": -1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// pentest_sessions: status index
|
|
self.pentest_sessions()
|
|
.create_index(IndexModel::builder().keys(doc! { "status": 1 }).build())
|
|
.await?;
|
|
|
|
// attack_chain_nodes: compound (session_id, node_id)
|
|
self.attack_chain_nodes()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "session_id": 1, "node_id": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
// pentest_messages: compound (session_id, created_at)
|
|
self.pentest_messages()
|
|
.create_index(
|
|
IndexModel::builder()
|
|
.keys(doc! { "session_id": 1, "created_at": 1 })
|
|
.build(),
|
|
)
|
|
.await?;
|
|
|
|
tracing::info!("Database indexes ensured");
|
|
Ok(())
|
|
}
|
|
|
|
pub fn repositories(&self) -> Collection<TrackedRepository> {
|
|
self.inner.collection("repositories")
|
|
}
|
|
|
|
pub fn findings(&self) -> Collection<Finding> {
|
|
self.inner.collection("findings")
|
|
}
|
|
|
|
pub fn scan_runs(&self) -> Collection<ScanRun> {
|
|
self.inner.collection("scan_runs")
|
|
}
|
|
|
|
pub fn sbom_entries(&self) -> Collection<SbomEntry> {
|
|
self.inner.collection("sbom_entries")
|
|
}
|
|
|
|
pub fn cve_alerts(&self) -> Collection<CveAlert> {
|
|
self.inner.collection("cve_alerts")
|
|
}
|
|
|
|
pub fn cve_notifications(
|
|
&self,
|
|
) -> Collection<compliance_core::models::notification::CveNotification> {
|
|
self.inner.collection("cve_notifications")
|
|
}
|
|
|
|
pub fn tracker_issues(&self) -> Collection<TrackerIssue> {
|
|
self.inner.collection("tracker_issues")
|
|
}
|
|
|
|
// Graph collections
|
|
pub fn graph_nodes(&self) -> Collection<compliance_core::models::graph::CodeNode> {
|
|
self.inner.collection("graph_nodes")
|
|
}
|
|
|
|
pub fn graph_edges(&self) -> Collection<compliance_core::models::graph::CodeEdge> {
|
|
self.inner.collection("graph_edges")
|
|
}
|
|
|
|
pub fn graph_builds(&self) -> Collection<compliance_core::models::graph::GraphBuildRun> {
|
|
self.inner.collection("graph_builds")
|
|
}
|
|
|
|
pub fn impact_analyses(&self) -> Collection<compliance_core::models::graph::ImpactAnalysis> {
|
|
self.inner.collection("impact_analyses")
|
|
}
|
|
|
|
// DAST collections
|
|
pub fn dast_targets(&self) -> Collection<DastTarget> {
|
|
self.inner.collection("dast_targets")
|
|
}
|
|
|
|
pub fn dast_scan_runs(&self) -> Collection<DastScanRun> {
|
|
self.inner.collection("dast_scan_runs")
|
|
}
|
|
|
|
pub fn dast_findings(&self) -> Collection<DastFinding> {
|
|
self.inner.collection("dast_findings")
|
|
}
|
|
|
|
// Embedding collections
|
|
pub fn code_embeddings(&self) -> Collection<compliance_core::models::embedding::CodeEmbedding> {
|
|
self.inner.collection("code_embeddings")
|
|
}
|
|
|
|
pub fn embedding_builds(
|
|
&self,
|
|
) -> Collection<compliance_core::models::embedding::EmbeddingBuildRun> {
|
|
self.inner.collection("embedding_builds")
|
|
}
|
|
|
|
// Pentest collections
|
|
pub fn pentest_sessions(&self) -> Collection<PentestSession> {
|
|
self.inner.collection("pentest_sessions")
|
|
}
|
|
|
|
pub fn attack_chain_nodes(&self) -> Collection<AttackChainNode> {
|
|
self.inner.collection("attack_chain_nodes")
|
|
}
|
|
|
|
pub fn pentest_messages(&self) -> Collection<PentestMessage> {
|
|
self.inner.collection("pentest_messages")
|
|
}
|
|
|
|
#[allow(dead_code)]
|
|
pub fn raw_collection(&self, name: &str) -> Collection<mongodb::bson::Document> {
|
|
self.inner.collection(name)
|
|
}
|
|
|
|
/// Get the raw MongoDB database handle (for graph persistence)
|
|
pub fn inner(&self) -> &mongodb::Database {
|
|
&self.inner
|
|
}
|
|
}
|