"""Signal Producer interface + Normalizer — one signal language, but TWO signal KINDS. The platform already HAS scanners (website, repo/code, SBOM, security headers, TLS, SPF/DKIM/DMARC, document analysis, RAG over uploads, product classification). The Silent Pass does not want a WebsiteScanner or a RepoScanner — it wants their UNIFIED output. So every source (a scanner, a PDF parser, a tender parser, an OEM spec, an API, or the user) emits the SAME `ProducedSignal` {signal_id, source_type, kind, confidence, evidence, provenance}, and `normalize_signals` reduces producer-specific ids to ONE canonical signal via a vocabulary (id + aliases + kind) — exactly the Requirement-Source / MCAP / regulation-alias pattern. The Silent Pass then never gets per-scanner logic. CRITICAL — a signal is one of two KINDS, and they NEVER substitute for each other: observation = "I SAW X" — a repo with an SBOM, a published security.txt, a risk-assessment PDF. requirement = "someone DEMANDS X" — a tender clause `requires_sbom`, an OEM spec `supplier_requires_psirt`. A demanded SBOM is NOT a present SBOM. `kind` is carried on the canonical VOCABULARY entry (authoritative), so even a mislabelled producer signal cannot collapse the two. The Silent Pass consumes ONLY observations; requirement signals are preserved and feed the required-set / prioritisation later. This Observation-vs- Requirement split is the very one the Requirements Verification Platform rests on: Observations (reality) vs Requirements (targets); their comparison IS the delta. Pure, deterministic, no I/O. Python 3.9 compatible. """ from __future__ import annotations from typing import Dict, List, Optional, Sequence from pydantic import BaseModel, Field from .silent_intake import IntakeSignal class ProducedSignal(BaseModel): """What ANY signal producer emits — the common interface every source agrees on.""" signal_id: str # raw or canonical id the producer used source_type: str = "" # website / repository / document / product / tender / oem / user / api kind: str = "" # "observation" | "requirement"; empty -> resolved from the vocabulary confidence: float = 1.0 evidence: Optional[str] = None # the artifact found (already in hand) provenance: str = "" # url / filename / tender clause / "customer statement" class SignalVocabularyEntry(BaseModel): """One canonical signal + its aliases + its KIND (the authoritative observation/requirement label).""" id: str kind: str = "observation" # "observation" (I saw X) | "requirement" (someone DEMANDS X) aliases: List[str] = Field(default_factory=list) def normalize_signals( produced: Sequence[ProducedSignal], vocabulary: Sequence[SignalVocabularyEntry] ) -> List[IntakeSignal]: """Reduce heterogeneous producer signals to the canonical IntakeSignal stream (alias resolution). The canonical vocabulary entry's `kind` is AUTHORITATIVE — a producer cannot relabel a requirement as an observation (that is what stops a demanded SBOM from masquerading as a present one). Unknown signal ids pass through unchanged (a new producer's signal stays visible, not silently dropped) and keep the producer-declared kind (default observation). Deterministic; carries confidence/evidence/provenance. """ alias: Dict[str, str] = {} kind_of: Dict[str, str] = {} for v in vocabulary: alias[v.id] = v.id kind_of[v.id] = v.kind for a in v.aliases: alias[a] = v.id out: List[IntakeSignal] = [] for p in produced: canonical = alias.get(p.signal_id, p.signal_id) kind = kind_of.get(canonical) or p.kind or "observation" out.append(IntakeSignal( source=p.source_type, signal=canonical, kind=kind, confidence=p.confidence, evidence=p.evidence, provenance=p.provenance)) return out