Files
breakpilot-compliance/backend-compliance/compliance/onboarding/silent_intake.py
T
Benjamin Admin c39787ad96 fix(onboarding): separate observation vs requirement signals — a demanded SBOM is not a present SBOM
Semantic correction of the knowledge base BEFORE the empirical loop (#59) is built — otherwise the
Observation Store would learn from already-misclassified signals. The Silent Pass conflated two kinds of
signal into one: an OBSERVATION ("I saw an SBOM in the repo") and a REQUIREMENT ("a tender DEMANDS an
SBOM"). They were aliased to the same canonical id, so a tender clause read as "SBOM already present" and
suppressed the very question that should have been asked.

Fix — make the kind explicit and authoritative (no new architecture, data + thin wiring):
  - `kind` ∈ {observation, requirement} on ProducedSignal (producer may declare) and on the canonical
    SignalVocabularyEntry (AUTHORITATIVE — a mislabelled producer cannot collapse the two).
  - Vocabulary split: sbom_file_found → sbom_present (obs) + sbom_required (req);
    security_txt_or_cvd_policy → cvd_policy_present (obs) + psirt_required (req); add signed_updates_required.
    requirement signals are intentionally UNMAPPED in intake_signal_map (they describe a target, not state).
  - silent_intake() consumes ONLY kind==observation; requirement signals are preserved in
    `requirements_seen` (visible/auditable) but NEVER become a detected capability.
  - normalize_signals() stamps the vocabulary's kind onto every IntakeSignal; unknown ids still pass through.

This is the same Observation-vs-Requirement split the Requirements Verification Platform rests on:
observations are reality, requirements are targets, and their comparison is the delta. A tender / OEM spec /
law now produces requirement signals; scanners / repos / documents produce observation signals.

Tests: rewrote the two test_signal_producer cases that previously ASSERTED the bug (tender == repo) to pin
the correct split; regression — `requires_sbom` yields no capability + stays in requirements_seen while
`cyclonedx_found` still detects sbom_creation; endpoint-level regression that a tender requirement does not
auto-detect and the gap stays asked; vocabulary-kind-overrides-mislabelled-producer. 25 onboarding tests
pass, mypy --strict clean, demo runs, check-loc 0. Runtime effect → deploy + smoke. (Fix A; partial-vs-
detected decoupling follows as Fix B before #59.)
2026-06-28 15:52:50 +02:00

117 lines
5.9 KiB
Python

"""Silent Knowledge Pass — recognise everything possible BEFORE asking a single question (Phase 0).
The Advisor can say "I need 5 answers" but does not yet decide WHAT it can find out by itself. The Silent
Pass runs first: from signals that existing scanners/parsers already produce (website, repository,
documents, product data) it deterministically derives capabilities the company demonstrably HAS and
product facts that drive scope — so every recognised item shrinks the delta and removes a question.
The customer then experiences "we already recognised 11 of 17 — only these 4 remain" instead of a
question wall. This is NOT new architecture: it is one orchestration step in front of the Advisor
Company -> Silent Intake -> Company Profile -> Hypotheses -> Delta -> Top Questions
All building blocks already exist. SIGNALS are INJECTED (the scanners produce them); the signal->capability
map is curated DATA, also injected. Pure, deterministic, no I/O. Python 3.9 compatible.
"""
from __future__ import annotations
from typing import Dict, List, Optional, Sequence, Set
from pydantic import BaseModel, Field
class IntakeSignal(BaseModel):
"""A CANONICAL signal the Silent Pass consumes. Producer-agnostic: the same `signal` may have come
from a website, a repo, a PDF, a tender or the user — normalize_signals() unified them (see signals.py)."""
source: str # source_type: website / repository / document / product / tender / user
signal: str # CANONICAL signal id, e.g. "sbom_present"
kind: str = "observation" # "observation" (I saw X) | "requirement" (someone DEMANDS X)
confidence: float = 1.0 # carried from the producer
evidence: Optional[str] = None # the artifact already in hand
provenance: str = "" # where it came from (url / filename / tender clause) — audit trail
detail: str = "" # free-text (kept for back-compat)
class SignalMapping(BaseModel):
"""Curated: what a signal lets us conclude. A signal yields a capability OR a product fact."""
signal: str
capability: Optional[str] = None # capability the signal evidences
relationship: str = "detected" # detected (concrete artifact) / partial (indicative)
evidence: Optional[str] = None # the artifact found (already in hand -> no upload needed)
product_fact: Optional[str] = None # e.g. "connected_to_internet"
fact_value: str = "true"
class DetectedCapability(BaseModel):
capability: str
relationship: str = "detected"
source: str = "" # which signal/source detected it (audit trail)
evidence: Optional[str] = None
confidence: float = 1.0 # carried from the producing signal
provenance: str = "" # where the signal came from
class ProductFact(BaseModel):
key: str
value: str = "true"
source: str = ""
class SilentIntakeResult(BaseModel):
detected_capabilities: List[DetectedCapability] = Field(default_factory=list)
product_facts: List[ProductFact] = Field(default_factory=list)
evidence_found: List[str] = Field(default_factory=list)
requirements_seen: List[str] = Field(default_factory=list) # requirement-kind signals — preserved, NOT present
summary: str = ""
def capability_ids(self) -> List[str]:
"""The detected capability ids — fed into the Advisor as already-present (delta-reducing).
ONLY observation-kind signals reach here (requirements never become a present capability)."""
return sorted({d.capability for d in self.detected_capabilities})
def silent_intake(
signals: Sequence[IntakeSignal], signal_map: Sequence[SignalMapping]
) -> SilentIntakeResult:
"""Derive capabilities + product facts from injected scanner signals (deterministic, no questions).
Each signal is matched to curated mappings by `signal` id; a mapping contributes either a detected
capability (+ optional evidence already in hand) or a product fact. Deduped, deterministic order.
"""
by_signal: Dict[str, List[SignalMapping]] = {}
for m in signal_map:
by_signal.setdefault(m.signal, []).append(m)
caps: Dict[str, DetectedCapability] = {}
facts: Dict[str, ProductFact] = {}
evidence: Set[str] = set()
requirements: Set[str] = set()
for s in signals:
if s.kind != "observation": # a requirement describes a TARGET, never the present state
requirements.add(s.signal) # preserved + visible, but NEVER turned into a capability
continue
for m in by_signal.get(s.signal, []):
if m.capability and m.capability not in caps:
caps[m.capability] = DetectedCapability(
capability=m.capability, relationship=m.relationship,
source="%s:%s" % (s.source, s.signal), evidence=m.evidence,
confidence=s.confidence, provenance=s.provenance)
if m.evidence:
evidence.add(m.evidence)
if m.product_fact:
facts[m.product_fact] = ProductFact(key=m.product_fact, value=m.fact_value, source=s.source)
detected = [caps[k] for k in sorted(caps)]
product_facts = [facts[k] for k in sorted(facts)]
requirements_seen = sorted(requirements)
summary = (
"Stille Vorbefüllung: %d Fähigkeit(en) automatisch erkannt, %d Produktfakt(en), %d Nachweis(e) "
"bereits vorhanden, %d Anforderung(en) erkannt (nicht als vorhanden gewertet)."
% (len(detected), len(product_facts), len(evidence), len(requirements_seen))
)
return SilentIntakeResult(
detected_capabilities=detected, product_facts=product_facts,
evidence_found=sorted(evidence), requirements_seen=requirements_seen, summary=summary)