Files
breakpilot-compliance/backend-compliance/compliance/knowledge_intake/schemas.py
T
Benjamin Admin 07e392913f feat(knowledge-intake): classify a document + assess its impact before extraction
Phase A1. The real knowledge production is not writing — it is TARGETED UPDATING: when 20 documents
arrive, which 5 change our knowledge and which 15 are ignorable? Before the parser, Knowledge Intake
classifies a new document (no content extraction) and intersects its signals with an index of the
existing knowledge to emit a Knowledge Package (an impact analysis).

- compliance/knowledge_intake/: build_knowledge_index(patterns, playbooks, reference_scenarios,
  obligation_index) + assess_document_impact(descriptor, index) -> KnowledgePackage. Deterministic,
  NO content extraction, NO LLM. Surfaces affected capabilities / playbooks / transition patterns /
  reference scenarios / (injected) obligations, whether it is a new domain, and a triage level
  (HIGH / LOW / NONE / NEW_DOMAIN) with a recommendation.
- ADR-006: Knowledge Intake = classify + impact before extraction; full factory Intake -> Package ->
  Parser -> Draft -> Review -> Published; phase order A1 Intake / A2 Draft / A3 Review.
- reference suite: "Knowledge Intake" section triages 3 example documents (CRA SBOM-FAQ -> high,
  14C/2PB/3RTS/2Obl; environmental guidance -> new_domain; marketing blog -> ignorable). Section
  lives in _helpers.py to keep generate.py under the 500-LOC budget.
- Honest known refinement surfaced by intake: regulation-ID normalization (CRA vs Cyber Resilience Act).

10 intake tests (60 with the adjacent modules), mypy --strict clean (16 files), check-loc 0.
Product code with no app caller + ADR/reference = non-runtime -> no deploy (ADR-001). Freeze-safe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 13:58:59 +02:00

63 lines
3.2 KiB
Python

"""Schemas for Knowledge Intake — classify a new document and assess its IMPACT (no extraction yet).
Before the parser/draft stages, Intake answers „welche Teile unseres Wissensbestands sind überhaupt
betroffen?". It does NOT extract content — it only classifies the document and intersects its signals
with an index of the existing knowledge (capabilities, playbooks, transition patterns, reference
scenarios, injected obligations) to emit a `KnowledgePackage` (an impact analysis). Deterministic,
computed-not-stored, no new corpus, no new meta-model class (freeze v1.0). Python 3.9 compatible.
"""
from __future__ import annotations
from enum import Enum
from typing import Dict, List
from pydantic import BaseModel, Field
class ImpactLevel(str, Enum):
NONE = "none" # touches nothing known -> likely ignorable
LOW = "low" # touches a little -> targeted review
HIGH = "high" # touches a lot -> prioritise review
NEW_DOMAIN = "new_domain" # references only unknown regulations -> domain intake
class DocumentDescriptor(BaseModel):
"""Lightweight signals of an incoming document — NO content body, only classification inputs."""
document_id: str
title: str = ""
source: str = "" # e.g. BSI, ENISA, EU
document_type: str = "" # e.g. guidance, faq, regulation, recommendation
regulations: List[str] = Field(default_factory=list) # declared regulations it references
keywords: List[str] = Field(default_factory=list) # lightweight topic signals (e.g. sbom)
product_types: List[str] = Field(default_factory=list)
class KnowledgeIndex(BaseModel):
"""A deterministic index of the EXISTING knowledge to match an incoming document against."""
regulations: List[str] = Field(default_factory=list) # all regulations the corpus knows
capability_regulations: Dict[str, List[str]] = Field(default_factory=dict) # capability -> covers_targets
playbook_capabilities: List[str] = Field(default_factory=list) # capabilities that HAVE a playbook
transition_patterns: Dict[str, List[str]] = Field(default_factory=dict) # pattern_id -> target regulations
reference_scenarios: Dict[str, List[str]] = Field(default_factory=dict) # rts_id -> regulations
obligation_index: Dict[str, List[str]] = Field(default_factory=dict) # regulation -> obligation ids (INJECTED)
class KnowledgePackage(BaseModel):
"""The impact analysis for one document — what of our knowledge it probably touches, and how much."""
document_id: str
classification: Dict[str, List[str]] = Field(default_factory=dict) # echoed regulations/keywords/types
new_domain: bool = False
unknown_regulations: List[str] = Field(default_factory=list)
affected_capabilities: List[str] = Field(default_factory=list)
affected_playbooks: List[str] = Field(default_factory=list)
affected_transition_patterns: List[str] = Field(default_factory=list)
affected_reference_scenarios: List[str] = Field(default_factory=list)
affected_obligations: List[str] = Field(default_factory=list)
impact_level: ImpactLevel = ImpactLevel.NONE
impact_summary: str = ""
recommendation: str = ""