fix(m7.1): JWKS refresh-on-failure in auth middleware #84

Merged
sharang merged 1 commits from fix/m7.1-jwks-refresh into main 2026-06-04 14:46:15 +00:00
Owner

Summary

Stops the silent 401 storm that happens when Keycloak rotates its signing keys. Today the JWKS is fetched once on first request and never re-fetched — after rotation every token mints a no matching key found in JWKS and the agent only recovers on restart.

This PR makes the auth path retry-once-on-stale:

  • Splits the validation path into a pure try_validate(token, header, kid, jwks) returning a ValidationError { Stale | Permanent } enum.
  • Stale covers (a) kid not present in the cached JWKS and (b) jsonwebtoken::ErrorKind::InvalidSignature — both indicate the cached keys no longer match KC's.
  • On Stale, fetch_or_get_jwks(state, force=true) refreshes from the JWKS endpoint and try_validate runs once more before short-circuiting to 401.
  • fetch_or_get_jwks now holds the write lock across the network fetch, so a thundering herd of concurrent refreshers all reuse the first writer's result instead of each fetching the URL.

Permanent failures (expired token, malformed claim, missing tenant_id) bypass refresh entirely.

Test plan

  • cargo test -p compliance-core --features axum — 44/44 passing (new test: try_validate_returns_stale_when_kid_missing_from_jwks)
  • cargo fmt --all / cargo clippy --workspace --exclude compliance-dashboard -- -D warnings — clean
  • Manual: run smoke.sh against KC, restart KC (rotating keys), confirm the next request triggers refresh and succeeds without an agent restart.

Follow-ups (not in this PR)

  • Wire compliance-agent to actually use the compliance-core middleware (parked PR #82, will rebase + shrink after this and #83 are in main).
  • Apply the same M7.1 protocol mappers to the prod breakpilot-dev realm.
## Summary Stops the silent 401 storm that happens when Keycloak rotates its signing keys. Today the JWKS is fetched once on first request and never re-fetched — after rotation every token mints a `no matching key found in JWKS` and the agent only recovers on restart. This PR makes the auth path retry-once-on-stale: - Splits the validation path into a pure `try_validate(token, header, kid, jwks)` returning a `ValidationError { Stale | Permanent }` enum. - `Stale` covers (a) `kid` not present in the cached JWKS and (b) `jsonwebtoken::ErrorKind::InvalidSignature` — both indicate the cached keys no longer match KC's. - On Stale, `fetch_or_get_jwks(state, force=true)` refreshes from the JWKS endpoint and `try_validate` runs once more before short-circuiting to 401. - `fetch_or_get_jwks` now holds the write lock across the network fetch, so a thundering herd of concurrent refreshers all reuse the first writer's result instead of each fetching the URL. `Permanent` failures (expired token, malformed claim, missing tenant_id) bypass refresh entirely. ## Test plan - [x] `cargo test -p compliance-core --features axum` — 44/44 passing (new test: `try_validate_returns_stale_when_kid_missing_from_jwks`) - [x] `cargo fmt --all` / `cargo clippy --workspace --exclude compliance-dashboard -- -D warnings` — clean - [ ] Manual: run smoke.sh against KC, restart KC (rotating keys), confirm the next request triggers refresh and succeeds without an agent restart. ## Follow-ups (not in this PR) - Wire compliance-agent to actually use the compliance-core middleware (parked PR #82, will rebase + shrink after this and #83 are in main). - Apply the same M7.1 protocol mappers to the prod `breakpilot-dev` realm.
sharang added 1 commit 2026-06-04 14:41:14 +00:00
fix(core): JWKS refresh-on-failure in M7.1 auth middleware
CI / Check (pull_request) Successful in 8m17s
CI / Detect Changes (pull_request) Has been skipped
CI / Deploy Agent (pull_request) Has been skipped
CI / Deploy Dashboard (pull_request) Has been skipped
CI / Deploy Docs (pull_request) Has been skipped
CI / Deploy MCP (pull_request) Has been skipped
f474699279
Without this, every Keycloak signing-key rotation produces a silent
401 storm against every request until the agent restarts — the cached
JWKS is held forever and never reconciled against KC.

Now: when `kid` isn't in the cached JWKS or the matching key fails
signature verification, we classify the failure as Stale, force a JWKS
refresh, and retry once. Anything else (expired, malformed, missing
tenant_id) is Permanent and short-circuits straight to 401.

* Splits the path into a pure `try_validate(token, header, kid, jwks)`
  helper returning a `ValidationError { Stale | Permanent }` enum.
* `fetch_or_get_jwks(state, force)` takes a force flag and holds the
  write lock across the network fetch so concurrent refreshers don't
  all hammer Keycloak when keys rotate (the second writer reuses what
  the first put in cache).
* Adds a unit test for the kid-not-found Stale classification.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sharang merged commit dbadff0aac into main 2026-06-04 14:46:15 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sharang/compliance-scanner-agent#84