c39787ad96
Semantic correction of the knowledge base BEFORE the empirical loop (#59) is built — otherwise the Observation Store would learn from already-misclassified signals. The Silent Pass conflated two kinds of signal into one: an OBSERVATION ("I saw an SBOM in the repo") and a REQUIREMENT ("a tender DEMANDS an SBOM"). They were aliased to the same canonical id, so a tender clause read as "SBOM already present" and suppressed the very question that should have been asked. Fix — make the kind explicit and authoritative (no new architecture, data + thin wiring): - `kind` ∈ {observation, requirement} on ProducedSignal (producer may declare) and on the canonical SignalVocabularyEntry (AUTHORITATIVE — a mislabelled producer cannot collapse the two). - Vocabulary split: sbom_file_found → sbom_present (obs) + sbom_required (req); security_txt_or_cvd_policy → cvd_policy_present (obs) + psirt_required (req); add signed_updates_required. requirement signals are intentionally UNMAPPED in intake_signal_map (they describe a target, not state). - silent_intake() consumes ONLY kind==observation; requirement signals are preserved in `requirements_seen` (visible/auditable) but NEVER become a detected capability. - normalize_signals() stamps the vocabulary's kind onto every IntakeSignal; unknown ids still pass through. This is the same Observation-vs-Requirement split the Requirements Verification Platform rests on: observations are reality, requirements are targets, and their comparison is the delta. A tender / OEM spec / law now produces requirement signals; scanners / repos / documents produce observation signals. Tests: rewrote the two test_signal_producer cases that previously ASSERTED the bug (tender == repo) to pin the correct split; regression — `requires_sbom` yields no capability + stays in requirements_seen while `cyclonedx_found` still detects sbom_creation; endpoint-level regression that a tender requirement does not auto-detect and the gap stays asked; vocabulary-kind-overrides-mislabelled-producer. 25 onboarding tests pass, mypy --strict clean, demo runs, check-loc 0. Runtime effect → deploy + smoke. (Fix A; partial-vs- detected decoupling follows as Fix B before #59.)
74 lines
3.9 KiB
Python
74 lines
3.9 KiB
Python
"""Signal Producer interface + Normalizer — one signal language, but TWO signal KINDS.
|
|
|
|
The platform already HAS scanners (website, repo/code, SBOM, security headers, TLS, SPF/DKIM/DMARC,
|
|
document analysis, RAG over uploads, product classification). The Silent Pass does not want a
|
|
WebsiteScanner or a RepoScanner — it wants their UNIFIED output. So every source (a scanner, a PDF
|
|
parser, a tender parser, an OEM spec, an API, or the user) emits the SAME `ProducedSignal`
|
|
{signal_id, source_type, kind, confidence, evidence, provenance}, and `normalize_signals` reduces
|
|
producer-specific ids to ONE canonical signal via a vocabulary (id + aliases + kind) — exactly the
|
|
Requirement-Source / MCAP / regulation-alias pattern. The Silent Pass then never gets per-scanner logic.
|
|
|
|
CRITICAL — a signal is one of two KINDS, and they NEVER substitute for each other:
|
|
observation = "I SAW X" — a repo with an SBOM, a published security.txt, a risk-assessment PDF.
|
|
requirement = "someone DEMANDS X" — a tender clause `requires_sbom`, an OEM spec `supplier_requires_psirt`.
|
|
A demanded SBOM is NOT a present SBOM. `kind` is carried on the canonical VOCABULARY entry (authoritative),
|
|
so even a mislabelled producer signal cannot collapse the two. The Silent Pass consumes ONLY observations;
|
|
requirement signals are preserved and feed the required-set / prioritisation later. This Observation-vs-
|
|
Requirement split is the very one the Requirements Verification Platform rests on: Observations (reality)
|
|
vs Requirements (targets); their comparison IS the delta. Pure, deterministic, no I/O. Python 3.9 compatible.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from typing import Dict, List, Optional, Sequence
|
|
|
|
from pydantic import BaseModel, Field
|
|
|
|
from .silent_intake import IntakeSignal
|
|
|
|
|
|
class ProducedSignal(BaseModel):
|
|
"""What ANY signal producer emits — the common interface every source agrees on."""
|
|
|
|
signal_id: str # raw or canonical id the producer used
|
|
source_type: str = "" # website / repository / document / product / tender / oem / user / api
|
|
kind: str = "" # "observation" | "requirement"; empty -> resolved from the vocabulary
|
|
confidence: float = 1.0
|
|
evidence: Optional[str] = None # the artifact found (already in hand)
|
|
provenance: str = "" # url / filename / tender clause / "customer statement"
|
|
|
|
|
|
class SignalVocabularyEntry(BaseModel):
|
|
"""One canonical signal + its aliases + its KIND (the authoritative observation/requirement label)."""
|
|
|
|
id: str
|
|
kind: str = "observation" # "observation" (I saw X) | "requirement" (someone DEMANDS X)
|
|
aliases: List[str] = Field(default_factory=list)
|
|
|
|
|
|
def normalize_signals(
|
|
produced: Sequence[ProducedSignal], vocabulary: Sequence[SignalVocabularyEntry]
|
|
) -> List[IntakeSignal]:
|
|
"""Reduce heterogeneous producer signals to the canonical IntakeSignal stream (alias resolution).
|
|
|
|
The canonical vocabulary entry's `kind` is AUTHORITATIVE — a producer cannot relabel a requirement as
|
|
an observation (that is what stops a demanded SBOM from masquerading as a present one). Unknown signal
|
|
ids pass through unchanged (a new producer's signal stays visible, not silently dropped) and keep the
|
|
producer-declared kind (default observation). Deterministic; carries confidence/evidence/provenance.
|
|
"""
|
|
alias: Dict[str, str] = {}
|
|
kind_of: Dict[str, str] = {}
|
|
for v in vocabulary:
|
|
alias[v.id] = v.id
|
|
kind_of[v.id] = v.kind
|
|
for a in v.aliases:
|
|
alias[a] = v.id
|
|
out: List[IntakeSignal] = []
|
|
for p in produced:
|
|
canonical = alias.get(p.signal_id, p.signal_id)
|
|
kind = kind_of.get(canonical) or p.kind or "observation"
|
|
out.append(IntakeSignal(
|
|
source=p.source_type, signal=canonical, kind=kind, confidence=p.confidence,
|
|
evidence=p.evidence, provenance=p.provenance))
|
|
return out
|