breakpilot-core

Benjamin_Boenisch/breakpilot-core

Fork 0

Commit Graph

Author	SHA1	Message	Date
Benjamin Admin	93687a32fe	docs(licenses): freeze 3-rule license mapping + audit script Defines the authoritative mapping from license_type to license_rule in docs/LICENSE_RULES.md, and adds scripts/audit_license_classification.py to surface classification gaps in registry/canonical_controls/Qdrant. Key finding from first audit run against bp-core-postgres + Qdrant: - regulation_registry: 232 rows, 224 rule=1, 8 rule=2, 0 rule=3; 36 rows without license_type (need backfill) - canonical_controls: 314,811 rows, 279,384 (89%) have NULL license_rule (target of Task #22 reclassification) - Qdrant atomic_controls_dedup: 100% of sampled points lack both license and license_rule payload fields - Qdrant bp_compliance_gesetze: 80.6% lack both fields - Qdrant bp_compliance_ce + bp_compliance: nearly clean Rule definitions clarified (was loosely remembered as "law / cite / rewrite"): - Rule 1 = verbatim, sovereign law (EU/DE/AT/CH/US, TRBS/TRGS/ASR, OSHA, NIST, EU guidelines, DGUV UVV) - Rule 2 = verbatim with attribution (CC-BY, Apache, OWASP, OECD AI Principles, ENISA) - Rule 3 = identifier citation only, no full text (DIN/EN/ISO, ANSI/UL/IEC, DGUV Regeln/Informationen/Grundsaetze, BSI, proprietary standards). Pipeline drops chunk_text when rule=3 in pipeline_adapter.py:147. The 4th category I had proposed ("R1-A") turned out to be already implemented as rule=2; the mapping doc reflects the actual code behaviour rather than the original 3-name verbal model. No schema change. No data migration in this commit — reclassification of the 279k controls is staged as Task #22 and will be cluster-based by source/regulation_id.	2026-05-21 11:29:38 +02:00
Benjamin Admin	9783657da3	feat(control-pipeline): incremental dedup + ENISA CRA ingestion CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-consent (push) Successful in 43s Details CI / test-python-voice (push) Successful in 33s Details CI / test-bqas (push) Successful in 37s Details BatchDedup since-Parameter (services/batch_dedup_runner.py + api): - Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since. - Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics deren control_id alphabetisch unter dem stale last_id liegt). - 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics). - Doku: control-pipeline/docs/incremental-dedup.md. Neue Scripts: - gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres Synonym-Matching). Pure INSERT/UPDATE, kein DELETE. - gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update. Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen. - ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology, CVD Policies, Threat Landscape 2025). chunk_strategy=legal, requirement_strength=guidance\|consultation_draft\|evidentiary. Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html (PDFs sind .gitignore-gefiltert). Ergebnis dieser Pipeline-Iteration: - 1.296 neue CRA-Controls + 19.652 atomare Children - +362 neue Master-Controls, 10.017 existing erweitert - Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:21:46 +02:00

Author

SHA1

Message

Date

Benjamin Admin

93687a32fe

docs(licenses): freeze 3-rule license mapping + audit script

Defines the authoritative mapping from license_type to license_rule
in docs/LICENSE_RULES.md, and adds scripts/audit_license_classification.py
to surface classification gaps in registry/canonical_controls/Qdrant.

Key finding from first audit run against bp-core-postgres + Qdrant:

- regulation_registry: 232 rows, 224 rule=1, 8 rule=2, 0 rule=3;
  36 rows without license_type (need backfill)
- canonical_controls: 314,811 rows, 279,384 (89%) have NULL
  license_rule (target of Task #22 reclassification)
- Qdrant atomic_controls_dedup: 100% of sampled points lack both
  license and license_rule payload fields
- Qdrant bp_compliance_gesetze: 80.6% lack both fields
- Qdrant bp_compliance_ce + bp_compliance: nearly clean

Rule definitions clarified (was loosely remembered as
"law / cite / rewrite"):
- Rule 1 = verbatim, sovereign law (EU/DE/AT/CH/US, TRBS/TRGS/ASR,
  OSHA, NIST, EU guidelines, DGUV UVV)
- Rule 2 = verbatim with attribution (CC-BY, Apache, OWASP,
  OECD AI Principles, ENISA)
- Rule 3 = identifier citation only, no full text (DIN/EN/ISO,
  ANSI/UL/IEC, DGUV Regeln/Informationen/Grundsaetze, BSI,
  proprietary standards). Pipeline drops chunk_text when rule=3
  in pipeline_adapter.py:147.

The 4th category I had proposed ("R1-A") turned out to be already
implemented as rule=2; the mapping doc reflects the actual code
behaviour rather than the original 3-name verbal model.

No schema change. No data migration in this commit — reclassification
of the 279k controls is staged as Task #22 and will be cluster-based
by source/regulation_id.

2026-05-21 11:29:38 +02:00

Benjamin Admin

9783657da3

feat(control-pipeline): incremental dedup + ENISA CRA ingestion

CI / go-lint (push) Has been skipped

Details

CI / python-lint (push) Has been skipped

Details

CI / nodejs-lint (push) Has been skipped

Details

CI / test-go-consent (push) Successful in 43s

Details

CI / test-python-voice (push) Successful in 33s

Details

CI / test-bqas (push) Successful in 37s

Details

BatchDedup since-Parameter (services/batch_dedup_runner.py + api):
- Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since.
- Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics
  deren control_id alphabetisch unter dem stale last_id liegt).
- 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics).
- Doku: control-pipeline/docs/incremental-dedup.md.

Neue Scripts:
- gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via
  bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres
  Synonym-Matching). Pure INSERT/UPDATE, kein DELETE.
- gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update.
  Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members
  appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen.
- ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente
  (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology,
  CVD Policies, Threat Landscape 2025). chunk_strategy=legal,
  requirement_strength=guidance|consultation_draft|evidentiary.

Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html
(PDFs sind .gitignore-gefiltert).

Ergebnis dieser Pipeline-Iteration:
- 1.296 neue CRA-Controls + 19.652 atomare Children
- +362 neue Master-Controls, 10.017 existing erweitert
- Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 18:21:46 +02:00

2 Commits