fix(onboarding): separate observation vs requirement signals — a demanded SBOM is not a present SBOM

Semantic correction of the knowledge base BEFORE the empirical loop (#59) is built — otherwise the
Observation Store would learn from already-misclassified signals. The Silent Pass conflated two kinds of
signal into one: an OBSERVATION ("I saw an SBOM in the repo") and a REQUIREMENT ("a tender DEMANDS an
SBOM"). They were aliased to the same canonical id, so a tender clause read as "SBOM already present" and
suppressed the very question that should have been asked.

Fix — make the kind explicit and authoritative (no new architecture, data + thin wiring):
  - `kind` ∈ {observation, requirement} on ProducedSignal (producer may declare) and on the canonical
    SignalVocabularyEntry (AUTHORITATIVE — a mislabelled producer cannot collapse the two).
  - Vocabulary split: sbom_file_found → sbom_present (obs) + sbom_required (req);
    security_txt_or_cvd_policy → cvd_policy_present (obs) + psirt_required (req); add signed_updates_required.
    requirement signals are intentionally UNMAPPED in intake_signal_map (they describe a target, not state).
  - silent_intake() consumes ONLY kind==observation; requirement signals are preserved in
    `requirements_seen` (visible/auditable) but NEVER become a detected capability.
  - normalize_signals() stamps the vocabulary's kind onto every IntakeSignal; unknown ids still pass through.

This is the same Observation-vs-Requirement split the Requirements Verification Platform rests on:
observations are reality, requirements are targets, and their comparison is the delta. A tender / OEM spec /
law now produces requirement signals; scanners / repos / documents produce observation signals.

Tests: rewrote the two test_signal_producer cases that previously ASSERTED the bug (tender == repo) to pin
the correct split; regression — `requires_sbom` yields no capability + stays in requirements_seen while
`cyclonedx_found` still detects sbom_creation; endpoint-level regression that a tender requirement does not
auto-detect and the gap stays asked; vocabulary-kind-overrides-mislabelled-producer. 25 onboarding tests
pass, mypy --strict clean, demo runs, check-loc 0. Runtime effect → deploy + smoke. (Fix A; partial-vs-
detected decoupling follows as Fix B before #59.)
This commit is contained in:
Benjamin Admin
2026-06-28 15:52:50 +02:00
parent b5b6cdddb3
commit c39787ad96
7 changed files with 121 additions and 42 deletions
@@ -1,16 +1,21 @@
"""Signal Producer interface + Normalizer — one signal language for all sources (NOT new architecture).
"""Signal Producer interface + Normalizer — one signal language, but TWO signal KINDS.
The platform already HAS scanners (website, repo/code, SBOM, security headers, TLS, SPF/DKIM/DMARC,
document analysis, RAG over uploads, product classification). The Silent Pass does not want a
WebsiteScanner or a RepoScanner — it wants their UNIFIED output. So every source (a scanner, a PDF
parser, a tender parser, an API, or the user) emits the SAME `ProducedSignal`
{signal_id, source_type, confidence, evidence, provenance}, and `normalize_signals` reduces producer-
specific signal ids to ONE canonical signal id via a vocabulary (id + aliases) — exactly the
parser, a tender parser, an OEM spec, an API, or the user) emits the SAME `ProducedSignal`
{signal_id, source_type, kind, confidence, evidence, provenance}, and `normalize_signals` reduces
producer-specific ids to ONE canonical signal via a vocabulary (id + aliases + kind) — exactly the
Requirement-Source / MCAP / regulation-alias pattern. The Silent Pass then never gets per-scanner logic.
A common DATA FORMAT, not a new module/framework. Later a tender (`requires_sbom`) or an OEM spec
(`supplier_requires_psirt`) produces the same stream as a website — the Silent Pass cannot tell the
difference. Pure, deterministic, no I/O. Python 3.9 compatible.
CRITICAL — a signal is one of two KINDS, and they NEVER substitute for each other:
observation = "I SAW X" — a repo with an SBOM, a published security.txt, a risk-assessment PDF.
requirement = "someone DEMANDS X" — a tender clause `requires_sbom`, an OEM spec `supplier_requires_psirt`.
A demanded SBOM is NOT a present SBOM. `kind` is carried on the canonical VOCABULARY entry (authoritative),
so even a mislabelled producer signal cannot collapse the two. The Silent Pass consumes ONLY observations;
requirement signals are preserved and feed the required-set / prioritisation later. This Observation-vs-
Requirement split is the very one the Requirements Verification Platform rests on: Observations (reality)
vs Requirements (targets); their comparison IS the delta. Pure, deterministic, no I/O. Python 3.9 compatible.
"""
from __future__ import annotations
@@ -27,15 +32,17 @@ class ProducedSignal(BaseModel):
signal_id: str # raw or canonical id the producer used
source_type: str = "" # website / repository / document / product / tender / oem / user / api
kind: str = "" # "observation" | "requirement"; empty -> resolved from the vocabulary
confidence: float = 1.0
evidence: Optional[str] = None # the artifact found (already in hand)
provenance: str = "" # url / filename / tender clause / "customer statement"
class SignalVocabularyEntry(BaseModel):
"""One canonical signal + the producer-specific aliases that mean the same thing."""
"""One canonical signal + its aliases + its KIND (the authoritative observation/requirement label)."""
id: str
kind: str = "observation" # "observation" (I saw X) | "requirement" (someone DEMANDS X)
aliases: List[str] = Field(default_factory=list)
@@ -44,18 +51,23 @@ def normalize_signals(
) -> List[IntakeSignal]:
"""Reduce heterogeneous producer signals to the canonical IntakeSignal stream (alias resolution).
Unknown signal ids pass through unchanged (a new producer's signal stays visible, not silently
dropped). Deterministic; carries confidence/evidence/provenance for the audit trail.
The canonical vocabulary entry's `kind` is AUTHORITATIVE — a producer cannot relabel a requirement as
an observation (that is what stops a demanded SBOM from masquerading as a present one). Unknown signal
ids pass through unchanged (a new producer's signal stays visible, not silently dropped) and keep the
producer-declared kind (default observation). Deterministic; carries confidence/evidence/provenance.
"""
alias: Dict[str, str] = {}
kind_of: Dict[str, str] = {}
for v in vocabulary:
alias[v.id] = v.id
kind_of[v.id] = v.kind
for a in v.aliases:
alias[a] = v.id
out: List[IntakeSignal] = []
for p in produced:
canonical = alias.get(p.signal_id, p.signal_id)
kind = kind_of.get(canonical) or p.kind or "observation"
out.append(IntakeSignal(
source=p.source_type, signal=canonical, confidence=p.confidence,
source=p.source_type, signal=canonical, kind=kind, confidence=p.confidence,
evidence=p.evidence, provenance=p.provenance))
return out
@@ -24,7 +24,8 @@ class IntakeSignal(BaseModel):
from a website, a repo, a PDF, a tender or the user — normalize_signals() unified them (see signals.py)."""
source: str # source_type: website / repository / document / product / tender / user
signal: str # CANONICAL signal id, e.g. "sbom_file_found"
signal: str # CANONICAL signal id, e.g. "sbom_present"
kind: str = "observation" # "observation" (I saw X) | "requirement" (someone DEMANDS X)
confidence: float = 1.0 # carried from the producer
evidence: Optional[str] = None # the artifact already in hand
provenance: str = "" # where it came from (url / filename / tender clause) — audit trail
@@ -61,10 +62,13 @@ class SilentIntakeResult(BaseModel):
detected_capabilities: List[DetectedCapability] = Field(default_factory=list)
product_facts: List[ProductFact] = Field(default_factory=list)
evidence_found: List[str] = Field(default_factory=list)
requirements_seen: List[str] = Field(default_factory=list) # requirement-kind signals — preserved, NOT present
summary: str = ""
def capability_ids(self) -> List[str]:
"""The detected capability ids — fed into the Advisor as already-present (delta-reducing)."""
"""The detected capability ids — fed into the Advisor as already-present (delta-reducing).
ONLY observation-kind signals reach here (requirements never become a present capability)."""
return sorted({d.capability for d in self.detected_capabilities})
@@ -83,7 +87,11 @@ def silent_intake(
caps: Dict[str, DetectedCapability] = {}
facts: Dict[str, ProductFact] = {}
evidence: Set[str] = set()
requirements: Set[str] = set()
for s in signals:
if s.kind != "observation": # a requirement describes a TARGET, never the present state
requirements.add(s.signal) # preserved + visible, but NEVER turned into a capability
continue
for m in by_signal.get(s.signal, []):
if m.capability and m.capability not in caps:
caps[m.capability] = DetectedCapability(
@@ -97,10 +105,12 @@ def silent_intake(
detected = [caps[k] for k in sorted(caps)]
product_facts = [facts[k] for k in sorted(facts)]
requirements_seen = sorted(requirements)
summary = (
"Stille Vorbefüllung: %d Fähigkeit(en) automatisch erkannt, %d Produktfakt(en), %d Nachweis(e) bereits vorhanden."
% (len(detected), len(product_facts), len(evidence))
"Stille Vorbefüllung: %d Fähigkeit(en) automatisch erkannt, %d Produktfakt(en), %d Nachweis(e) "
"bereits vorhanden, %d Anforderung(en) erkannt (nicht als vorhanden gewertet)."
% (len(detected), len(product_facts), len(evidence), len(requirements_seen))
)
return SilentIntakeResult(
detected_capabilities=detected, product_facts=product_facts,
evidence_found=sorted(evidence), summary=summary)
evidence_found=sorted(evidence), requirements_seen=requirements_seen, summary=summary)