feat(knowledge-intake): classify a document + assess its impact before extraction

Phase A1. The real knowledge production is not writing — it is TARGETED UPDATING: when 20 documents
arrive, which 5 change our knowledge and which 15 are ignorable? Before the parser, Knowledge Intake
classifies a new document (no content extraction) and intersects its signals with an index of the
existing knowledge to emit a Knowledge Package (an impact analysis).

- compliance/knowledge_intake/: build_knowledge_index(patterns, playbooks, reference_scenarios,
  obligation_index) + assess_document_impact(descriptor, index) -> KnowledgePackage. Deterministic,
  NO content extraction, NO LLM. Surfaces affected capabilities / playbooks / transition patterns /
  reference scenarios / (injected) obligations, whether it is a new domain, and a triage level
  (HIGH / LOW / NONE / NEW_DOMAIN) with a recommendation.
- ADR-006: Knowledge Intake = classify + impact before extraction; full factory Intake -> Package ->
  Parser -> Draft -> Review -> Published; phase order A1 Intake / A2 Draft / A3 Review.
- reference suite: "Knowledge Intake" section triages 3 example documents (CRA SBOM-FAQ -> high,
  14C/2PB/3RTS/2Obl; environmental guidance -> new_domain; marketing blog -> ignorable). Section
  lives in _helpers.py to keep generate.py under the 500-LOC budget.
- Honest known refinement surfaced by intake: regulation-ID normalization (CRA vs Cyber Resilience Act).

10 intake tests (60 with the adjacent modules), mypy --strict clean (16 files), check-loc 0.
Product code with no app caller + ADR/reference = non-runtime -> no deploy (ADR-001). Freeze-safe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-06-27 13:58:59 +02:00
parent d51bcd77c7
commit 07e392913f
8 changed files with 419 additions and 2 deletions
@@ -59,3 +59,48 @@ def unsupported_block(rmap) -> None:
def interp_status(verdict_value: str) -> str:
return "PARTIAL" if verdict_value in ("uncertain", "unsupported") else "PASS"
def knowledge_intake_section(base_dir) -> None:
"""Render the Knowledge Intake section (kept here so generate.py stays under the LOC budget)."""
import os
import yaml
from compliance.knowledge_intake import (
DocumentDescriptor, assess_document_impact, build_knowledge_index,
)
def _load(sub):
d = os.path.join(base_dir, "..", "knowledge", sub)
return [yaml.safe_load(open(os.path.join(d, f), encoding="utf-8"))
for f in sorted(os.listdir(d)) if f.endswith(".yaml")]
idx = build_knowledge_index(
_load("transition_patterns"), _load("implementation_playbooks"),
_load("reference_transition_scenarios"), obligation_index={"CRA": ["cra_obl_1", "cra_obl_2"]})
docs = [
DocumentDescriptor(document_id="ENISA CRA SBOM-FAQ", regulations=["CRA"], keywords=["sbom", "vulnerability"], document_type="faq"),
DocumentDescriptor(document_id="EU Umwelt-Leitfaden", regulations=["UmweltVO"], keywords=["wastewater"], document_type="guidance"),
DocumentDescriptor(document_id="Marketing-Blog", keywords=["newsletter"], document_type="blog"),
]
w("## Knowledge Intake — Impact zuerst, Extraktion später")
w("")
w('_Vor dem Parser: ein neues Dokument NUR einordnen und seinen Impact auf den bestehenden Wissensbestand bestimmen. „Von N Dokumenten verändern wenige tatsächlich unser Wissen." Deterministisch, keine Extraktion, kein LLM._')
w("")
w("| Dokument | Impact | betrifft | Empfehlung |")
w("|---|---|---|---|")
for d in docs:
kp = assess_document_impact(d, idx)
touch = "neue Domäne" if kp.new_domain else "%d%dPB·%dRTS·%dObl" % (
len(kp.affected_capabilities), len(kp.affected_playbooks),
len(kp.affected_reference_scenarios), len(kp.affected_obligations))
w("| %s | **%s** | %s | %s |" % (d.document_id, kp.impact_level.value, touch, kp.recommendation.split("")[0]))
w("")
w("**Beispiel-Knowledge-Package** (`%s`): %s" % (docs[0].document_id, assess_document_impact(docs[0], idx).impact_summary))
w("")
w('_So entsteht bei jedem neuen Dokument eine Impact-Analyse statt „200 Seiten PDF" — Targeted Updating statt Schreiben._')
w("")
coverage_table([
("Knowledge Intake (Klassifikation+Impact)", "PASS", "%d Regelwerke / %d Capabilities im Index" % (len(idx.regulations), len(idx.capability_regulations))),
("Impact-Triage (HIGH/LOW/NONE/new_domain)", "PASS", "3 Beispiel-Dokumente korrekt eingeordnet"),
("Regelwerk-ID-Normalisierung", "TODO", "CRA vs Cyber Resilience Act vereinheitlichen"),
])
@@ -46,6 +46,7 @@ import yaml
from _helpers import ( # noqa: E402 (script-dir module; keeps generate.py under the LOC budget)
OUT, ROLLUP, Row, w, coverage_table, reg_map_block, unsupported_block, interp_status,
knowledge_intake_section,
)
ISO_MAP = {"ISO27001": CapabilityMappingEntry(
@@ -463,6 +464,8 @@ coverage_table([
("Draft-Generatoren neue Domänen (Phase A)", "TODO", "Transition-/Reference-Scenario-Drafts"),
])
knowledge_intake_section(os.path.dirname(__file__)) # Knowledge Intake (impact triage) — kept in _helpers for LOC
# ── Epics + roll-up ───────────────────────────────────────────────────────
w("## Gaps → Epics (Backlog — nur erfasst, NICHT implementiert)")
w("")
@@ -318,6 +318,28 @@ _So reviewt der Experte 12 Entwürfe statt 12 Playbooks zu schreiben. Derselbe G
| Provenance + TODO + Freigabestatus | **PASS** | draft_generated→reviewed→validated→proven |
| Draft-Generatoren neue Domänen (Phase A) | **TODO** | Transition-/Reference-Scenario-Drafts |
## Knowledge Intake — Impact zuerst, Extraktion später
_Vor dem Parser: ein neues Dokument NUR einordnen und seinen Impact auf den bestehenden Wissensbestand bestimmen. „Von N Dokumenten verändern wenige tatsächlich unser Wissen." Deterministisch, keine Extraktion, kein LLM._
| Dokument | Impact | betrifft | Empfehlung |
|---|---|---|---|
| ENISA CRA SBOM-FAQ | **high** | 14C·2PB·3RTS·2Obl | Gezielter Review priorisieren |
| EU Umwelt-Leitfaden | **new_domain** | neue Domäne | Neue Domäne |
| Marketing-Blog | **none** | 0C·0PB·0RTS·0Obl | Wahrscheinlich ignorierbar |
**Beispiel-Knowledge-Package** (`ENISA CRA SBOM-FAQ`): Betrifft 14 Capabilities, 2 Playbooks, 0 Patterns, 3 Reference Scenarios, 2 Obligations; keine neue Domäne.
_So entsteht bei jedem neuen Dokument eine Impact-Analyse statt „200 Seiten PDF" — Targeted Updating statt Schreiben._
**Architecture Coverage**
| Layer | Status | Hinweis |
|---|---|---|
| Knowledge Intake (Klassifikation+Impact) | **PASS** | 6 Regelwerke / 32 Capabilities im Index |
| Impact-Triage (HIGH/LOW/NONE/new_domain) | **PASS** | 3 Beispiel-Dokumente korrekt eingeordnet |
| Regelwerk-ID-Normalisierung | **TODO** | CRA vs Cyber Resilience Act vereinheitlichen |
## Gaps → Epics (Backlog — nur erfasst, NICHT implementiert)
| Epic | Titel | schliesst Coverage-Luecke |
@@ -329,6 +351,6 @@ _So reviewt der Experte 12 Entwürfe statt 12 Playbooks zu schreiben. Derselbe G
## Suite-Status (Roll-up)
- Coverage-Zellen gesamt: **38**
- PASS: **28** · PARTIAL: 3 · UNSUPPORTED: 1 · TODO: 5 · N/A: 1 · NEEDS_FACTS: 0
- Coverage-Zellen gesamt: **41**
- PASS: **30** · PARTIAL: 3 · UNSUPPORTED: 1 · TODO: 6 · N/A: 1 · NEEDS_FACTS: 0
- Fortschritt = PASS-Anteil steigt, wenn Epics RS-001…004 landen (objektiver Maßstab, kein LOC).