fix(m7.1): JWKS refresh-on-failure in auth middleware #84
Reference in New Issue
Block a user
Delete Branch "fix/m7.1-jwks-refresh"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Stops the silent 401 storm that happens when Keycloak rotates its signing keys. Today the JWKS is fetched once on first request and never re-fetched — after rotation every token mints a
no matching key found in JWKSand the agent only recovers on restart.This PR makes the auth path retry-once-on-stale:
try_validate(token, header, kid, jwks)returning aValidationError { Stale | Permanent }enum.Stalecovers (a)kidnot present in the cached JWKS and (b)jsonwebtoken::ErrorKind::InvalidSignature— both indicate the cached keys no longer match KC's.fetch_or_get_jwks(state, force=true)refreshes from the JWKS endpoint andtry_validateruns once more before short-circuiting to 401.fetch_or_get_jwksnow holds the write lock across the network fetch, so a thundering herd of concurrent refreshers all reuse the first writer's result instead of each fetching the URL.Permanentfailures (expired token, malformed claim, missing tenant_id) bypass refresh entirely.Test plan
cargo test -p compliance-core --features axum— 44/44 passing (new test:try_validate_returns_stale_when_kid_missing_from_jwks)cargo fmt --all/cargo clippy --workspace --exclude compliance-dashboard -- -D warnings— cleanFollow-ups (not in this PR)
breakpilot-devrealm.Without this, every Keycloak signing-key rotation produces a silent 401 storm against every request until the agent restarts — the cached JWKS is held forever and never reconciled against KC. Now: when `kid` isn't in the cached JWKS or the matching key fails signature verification, we classify the failure as Stale, force a JWKS refresh, and retry once. Anything else (expired, malformed, missing tenant_id) is Permanent and short-circuits straight to 401. * Splits the path into a pure `try_validate(token, header, kid, jwks)` helper returning a `ValidationError { Stale | Permanent }` enum. * `fetch_or_get_jwks(state, force)` takes a force flag and holds the write lock across the network fetch so concurrent refreshers don't all hammer Keycloak when keys rotate (the second writer reuses what the first put in cache). * Adds a unit test for the kid-not-found Stale classification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>