feat(iace): benchmark risk comparison (traffic lights) + misuse pattern + 1:n matcher
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m23s
CI / test-go (push) Failing after 37s
CI / iace-gt-coverage (push) Successful in 24s
CI / test-python-backend (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m23s
CI / test-go (push) Failing after 37s
CI / iace-gt-coverage (push) Successful in 24s
CI / test-python-backend (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
#1 Risk-number comparison in the benchmark: ComputeRiskComparison derives the tool's S/F/W/P + Fine-Kinney per matched hazard and compares to the GT values; exposed on the benchmark response and rendered in a new RiskComparison table with GREEN/YELLOW/RED traffic lights on the risk number R (like the Excel), plus per-axis within-1 agreement cards. #2 Generic misuse pattern HP2103 "Personenbefoerderung auf Hebezeug" — gated to lift-family machine types, fires for ANY lifting device (not machine-specific). #3 Benchmark matcher is now 1:n — one broad engine hazard may cover several fine-grained GT sub-scenarios (foot/hand/leg crush), so coverage reflects real risk coverage rather than 1:1 wording matches. Validated on BOTH ground truths (robot cell + lift): leakage 0, ghosts 0, coverage held. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -48,6 +48,20 @@ export interface CategoryScore {
|
||||
category: string; gt_count: number; match_count: number; coverage: number
|
||||
}
|
||||
|
||||
export interface RiskComparisonPair {
|
||||
hazard_name: string
|
||||
gt_severity: number; gt_frequency: number; gt_probability: number; gt_avoidance: number; gt_risk: number
|
||||
eng_severity: number; eng_frequency: number; eng_probability: number; eng_avoidance: number
|
||||
fk_score: number; fk_band: string
|
||||
}
|
||||
|
||||
export interface RiskAgreement {
|
||||
n: number
|
||||
severity_within1: number; frequency_within1: number
|
||||
probability_within1: number; avoidance_within1: number
|
||||
rank_concordance: number
|
||||
}
|
||||
|
||||
export interface BenchmarkResult {
|
||||
coverage_score: number
|
||||
measure_coverage: number
|
||||
@@ -58,6 +72,8 @@ export interface BenchmarkResult {
|
||||
extra_in_engine: HazardSummary[]
|
||||
category_breakdown: CategoryScore[]
|
||||
risk_rank_pairs: { gt_rank: number; engine_rank: number; hazard_name: string; gt_risk_score: number }[]
|
||||
risk_comparison?: RiskComparisonPair[]
|
||||
risk_agreement?: RiskAgreement
|
||||
}
|
||||
|
||||
interface UseBenchmarkReturn {
|
||||
|
||||
Reference in New Issue
Block a user