Commit Graph

2 Commits

Author SHA1 Message Date
Benjamin Admin 93687a32fe docs(licenses): freeze 3-rule license mapping + audit script
Defines the authoritative mapping from license_type to license_rule
in docs/LICENSE_RULES.md, and adds scripts/audit_license_classification.py
to surface classification gaps in registry/canonical_controls/Qdrant.

Key finding from first audit run against bp-core-postgres + Qdrant:

- regulation_registry: 232 rows, 224 rule=1, 8 rule=2, 0 rule=3;
  36 rows without license_type (need backfill)
- canonical_controls: 314,811 rows, 279,384 (89%) have NULL
  license_rule (target of Task #22 reclassification)
- Qdrant atomic_controls_dedup: 100% of sampled points lack both
  license and license_rule payload fields
- Qdrant bp_compliance_gesetze: 80.6% lack both fields
- Qdrant bp_compliance_ce + bp_compliance: nearly clean

Rule definitions clarified (was loosely remembered as
"law / cite / rewrite"):
- Rule 1 = verbatim, sovereign law (EU/DE/AT/CH/US, TRBS/TRGS/ASR,
  OSHA, NIST, EU guidelines, DGUV UVV)
- Rule 2 = verbatim with attribution (CC-BY, Apache, OWASP,
  OECD AI Principles, ENISA)
- Rule 3 = identifier citation only, no full text (DIN/EN/ISO,
  ANSI/UL/IEC, DGUV Regeln/Informationen/Grundsaetze, BSI,
  proprietary standards). Pipeline drops chunk_text when rule=3
  in pipeline_adapter.py:147.

The 4th category I had proposed ("R1-A") turned out to be already
implemented as rule=2; the mapping doc reflects the actual code
behaviour rather than the original 3-name verbal model.

No schema change. No data migration in this commit — reclassification
of the 279k controls is staged as Task #22 and will be cluster-based
by source/regulation_id.
2026-05-21 11:29:38 +02:00
Benjamin Admin 9783657da3 feat(control-pipeline): incremental dedup + ENISA CRA ingestion
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 37s
BatchDedup since-Parameter (services/batch_dedup_runner.py + api):
- Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since.
- Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics
  deren control_id alphabetisch unter dem stale last_id liegt).
- 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics).
- Doku: control-pipeline/docs/incremental-dedup.md.

Neue Scripts:
- gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via
  bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres
  Synonym-Matching). Pure INSERT/UPDATE, kein DELETE.
- gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update.
  Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members
  appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen.
- ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente
  (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology,
  CVD Policies, Threat Landscape 2025). chunk_strategy=legal,
  requirement_strength=guidance|consultation_draft|evidentiary.

Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html
(PDFs sind .gitignore-gefiltert).

Ergebnis dieser Pipeline-Iteration:
- 1.296 neue CRA-Controls + 19.652 atomare Children
- +362 neue Master-Controls, 10.017 existing erweitert
- Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:21:46 +02:00