Add retry logic and circuit breakers for external service calls #10

Open
opened 2026-04-20 09:35:15 +00:00 by sharang · 0 comments
Owner

Problem

All calls to Qdrant, Ollama, Anthropic, and external HTTP endpoints are single-shot with no retry or backoff:

  • compliance/services/llm_provider.py — single HTTP call, no tenacity decorator
  • Document crawler LLM calls — no retry on transient failures
  • AI compliance SDK calls to backend — no circuit breaker

A single transient timeout causes the entire user request to fail and is indistinguishable from a permanent failure.

Required Actions

  1. Add tenacity to backend-compliance/requirements.txt
  2. Wrap all external HTTP calls with @retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(3), reraise=True)
  3. Implement a circuit breaker for Qdrant using pybreaker or equivalent — open the circuit after 5 consecutive failures, half-open after 30s
  4. Add a fallback path for LLM calls: if primary model unavailable, attempt secondary; if both fail, return a structured error (not 500)
  5. Apply same pattern in Go SDK: use github.com/sony/gobreaker for backend calls

Acceptance Criteria

  • Simulated Qdrant timeout causes retry with backoff, not immediate 500
  • Circuit opens after threshold failures, returns 503 with Retry-After header
  • LLM fallback path tested
## Problem All calls to Qdrant, Ollama, Anthropic, and external HTTP endpoints are single-shot with no retry or backoff: - `compliance/services/llm_provider.py` — single HTTP call, no `tenacity` decorator - Document crawler LLM calls — no retry on transient failures - AI compliance SDK calls to backend — no circuit breaker A single transient timeout causes the entire user request to fail and is indistinguishable from a permanent failure. ## Required Actions 1. Add `tenacity` to `backend-compliance/requirements.txt` 2. Wrap all external HTTP calls with `@retry(wait=wait_exponential(min=1, max=10), stop=stop_after_attempt(3), reraise=True)` 3. Implement a circuit breaker for Qdrant using `pybreaker` or equivalent — open the circuit after 5 consecutive failures, half-open after 30s 4. Add a fallback path for LLM calls: if primary model unavailable, attempt secondary; if both fail, return a structured error (not 500) 5. Apply same pattern in Go SDK: use `github.com/sony/gobreaker` for backend calls ## Acceptance Criteria - Simulated Qdrant timeout causes retry with backoff, not immediate 500 - Circuit opens after threshold failures, returns 503 with `Retry-After` header - LLM fallback path tested
sharang added this to the M2: Data Integrity & Reliability milestone 2026-04-20 09:35:15 +00:00
sharang added the reliabilityseverity: high labels 2026-04-20 09:35:15 +00:00
Sign in to join this conversation.