feat(knowledge-intake): classify a document + assess its impact before extraction
Phase A1. The real knowledge production is not writing — it is TARGETED UPDATING: when 20 documents arrive, which 5 change our knowledge and which 15 are ignorable? Before the parser, Knowledge Intake classifies a new document (no content extraction) and intersects its signals with an index of the existing knowledge to emit a Knowledge Package (an impact analysis). - compliance/knowledge_intake/: build_knowledge_index(patterns, playbooks, reference_scenarios, obligation_index) + assess_document_impact(descriptor, index) -> KnowledgePackage. Deterministic, NO content extraction, NO LLM. Surfaces affected capabilities / playbooks / transition patterns / reference scenarios / (injected) obligations, whether it is a new domain, and a triage level (HIGH / LOW / NONE / NEW_DOMAIN) with a recommendation. - ADR-006: Knowledge Intake = classify + impact before extraction; full factory Intake -> Package -> Parser -> Draft -> Review -> Published; phase order A1 Intake / A2 Draft / A3 Review. - reference suite: "Knowledge Intake" section triages 3 example documents (CRA SBOM-FAQ -> high, 14C/2PB/3RTS/2Obl; environmental guidance -> new_domain; marketing blog -> ignorable). Section lives in _helpers.py to keep generate.py under the 500-LOC budget. - Honest known refinement surfaced by intake: regulation-ID normalization (CRA vs Cyber Resilience Act). 10 intake tests (60 with the adjacent modules), mypy --strict clean (16 files), check-loc 0. Product code with no app caller + ADR/reference = non-runtime -> no deploy (ADR-001). Freeze-safe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -318,6 +318,28 @@ _So reviewt der Experte 12 Entwürfe statt 12 Playbooks zu schreiben. Derselbe G
|
||||
| Provenance + TODO + Freigabestatus | **PASS** | draft_generated→reviewed→validated→proven |
|
||||
| Draft-Generatoren neue Domänen (Phase A) | **TODO** | Transition-/Reference-Scenario-Drafts |
|
||||
|
||||
## Knowledge Intake — Impact zuerst, Extraktion später
|
||||
|
||||
_Vor dem Parser: ein neues Dokument NUR einordnen und seinen Impact auf den bestehenden Wissensbestand bestimmen. „Von N Dokumenten verändern wenige tatsächlich unser Wissen." Deterministisch, keine Extraktion, kein LLM._
|
||||
|
||||
| Dokument | Impact | betrifft | Empfehlung |
|
||||
|---|---|---|---|
|
||||
| ENISA CRA SBOM-FAQ | **high** | 14C·2PB·3RTS·2Obl | Gezielter Review priorisieren |
|
||||
| EU Umwelt-Leitfaden | **new_domain** | neue Domäne | Neue Domäne |
|
||||
| Marketing-Blog | **none** | 0C·0PB·0RTS·0Obl | Wahrscheinlich ignorierbar |
|
||||
|
||||
**Beispiel-Knowledge-Package** (`ENISA CRA SBOM-FAQ`): Betrifft 14 Capabilities, 2 Playbooks, 0 Patterns, 3 Reference Scenarios, 2 Obligations; keine neue Domäne.
|
||||
|
||||
_So entsteht bei jedem neuen Dokument eine Impact-Analyse statt „200 Seiten PDF" — Targeted Updating statt Schreiben._
|
||||
|
||||
**Architecture Coverage**
|
||||
|
||||
| Layer | Status | Hinweis |
|
||||
|---|---|---|
|
||||
| Knowledge Intake (Klassifikation+Impact) | **PASS** | 6 Regelwerke / 32 Capabilities im Index |
|
||||
| Impact-Triage (HIGH/LOW/NONE/new_domain) | **PASS** | 3 Beispiel-Dokumente korrekt eingeordnet |
|
||||
| Regelwerk-ID-Normalisierung | **TODO** | CRA vs Cyber Resilience Act vereinheitlichen |
|
||||
|
||||
## Gaps → Epics (Backlog — nur erfasst, NICHT implementiert)
|
||||
|
||||
| Epic | Titel | schliesst Coverage-Luecke |
|
||||
@@ -329,6 +351,6 @@ _So reviewt der Experte 12 Entwürfe statt 12 Playbooks zu schreiben. Derselbe G
|
||||
|
||||
## Suite-Status (Roll-up)
|
||||
|
||||
- Coverage-Zellen gesamt: **38**
|
||||
- PASS: **28** · PARTIAL: 3 · UNSUPPORTED: 1 · TODO: 5 · N/A: 1 · NEEDS_FACTS: 0
|
||||
- Coverage-Zellen gesamt: **41**
|
||||
- PASS: **30** · PARTIAL: 3 · UNSUPPORTED: 1 · TODO: 6 · N/A: 1 · NEEDS_FACTS: 0
|
||||
- Fortschritt = PASS-Anteil steigt, wenn Epics RS-001…004 landen (objektiver Maßstab, kein LOC).
|
||||
|
||||
Reference in New Issue
Block a user