Retries: exponential backoff, jitter

PaySetu's fraud-score service drops 9% of requests for 200 ms during a routine config reload at 19:14:03. Every caller — checkout, refund, settlement, partner-bank webhook handler — sees a transient 503 Service Unavailable and reaches for the obvious fix: a retry loop. The retry loop is three lines. It looks fine in code review. At 19:14:03.250 the fraud-score service finishes its reload, comes back up, and is immediately hit by a 14× traffic spike — every caller's first retry, plus their second retry 100 ms later, plus a fresh wave of new traffic. The service falls over. At 19:14:04 the callers retry again, exactly in lockstep, because they all started from the same 19:14:03 timestamp and chose the same fixed retry interval. By 19:14:30 the service is so deep in connection-pool exhaustion that the only thing that brings it back is shutting off all callers for two minutes. The 200 ms blip became a 30-minute outage. The line of code that caused it is time.sleep(0.1); retry().

Retries with no backoff turn a transient failure into a self-amplifying load spike. Exponential backoff alone — sleep(2^n × base) — clusters retries at the same instant across all callers, producing the same thundering herd. The fix is decorrelated jitter: every retry's wait time is a fresh uniform random sample bounded by the current backoff envelope. Four lines of code; orders of magnitude difference in recovery behaviour.

Why a naive retry destroys the thing it tries to save

The shape of the failure matters. A backend that returned 503 for 200 ms because it was reloading config is not broken — it is asking for 200 ms of patience. The caller has a choice: wait, retry, or give up. Retrying is correct; retrying immediately and uniformly across every caller is what turns a 200 ms blip into a sustained outage.

Three things compound:

Synchronisation. Every caller observed the same 503 at roughly the same wall-clock instant. If they all wait a fixed 100 ms and retry, they all retry at the same moment.
Amplification. A backend serving 1000 RPS at p50=20ms has roughly 20 in-flight requests. If 9% (90) fail and immediately retry, the in-flight count jumps to 110. If those retries also fail and retry, you reach 200. The backend was already at the edge of its capacity; the retries push it over.
No upper bound on attempts. A retry loop with no max-attempts and no global retry budget will keep firing as long as it sees errors. Under sustained partial failure, the retry traffic exceeds the original traffic.

Illustrative — not measured data. A naive retry loop and a jittered retry loop see the same 200 ms outage; the difference is whether the recovery wave is a spike or a smear.

The amplification is asymmetric: the backend sees a load spike, but the caller sees nothing unusual — its own RPS is the same; it is just doing more retries internally. The traditional caller-side dashboard ("RPS to backend X") will not show the spike. The backend's dashboard will. Why callers cannot self-detect a retry storm: each caller's local view is "I issued 100 requests in the last second, normal". It is the fleet of callers — every checkout pod retrying — that aggregates into a spike at the backend. No single caller did anything wrong. This is why the retry budget (a fleet-wide cap on the fraction of traffic that can be retries) is the only correct fix at scale, and why the Google SRE book mandates 10% as the default budget. Per-caller max_retries=3 does not bound the fleet's retry rate; it bounds each caller's, which composes badly.

The math: backoff alone is necessary but not sufficient

The standard retry pattern is sleep(base × 2^attempt). After the first failure, wait base (say 100 ms). After the second, wait 2 × base. After the third, 4 × base. The wait window grows exponentially. The intuition is right — back off harder as the failure looks more persistent — but the implementation is dangerously incomplete.

Consider 1000 callers all observing the same 503 at t=0. With pure exponential backoff and base=100ms:

All 1000 retry at exactly t=100ms.
Of those that fail, all retry at exactly t=300ms (100 + 200).
All retry at exactly t=700ms (100 + 200 + 400).

The retries are smeared across attempts but clustered within each attempt — every caller fires its n-th retry at the same instant. The backend sees three giant spikes instead of one continuous spike. Less bad, but still a thundering herd.

The fix is jitter — adding randomness to each wait. The AWS Builder's Library article by Marc Brooker formalised three jitter strategies; the third is the one to use:

Strategy	Formula	What it does
No jitter	`wait = base × 2^n`	All callers retry in lockstep — bad
Equal jitter	`wait = (base × 2^n / 2) + uniform(0, base × 2^n / 2)`	Half the wait is fixed, half random — better, still partial clustering
Full jitter	`wait = uniform(0, base × 2^n)`	Wait is uniformly random in `[0, current envelope]` — best smearing
Decorrelated jitter	`wait = min(cap, uniform(base, prev_wait × 3))`	State-aware: each attempt's wait depends on the previous, with growing variance

Brooker's empirical result: full jitter and decorrelated jitter are nearly indistinguishable in spike-flattening but decorrelated jitter has slightly tighter convergence. Full jitter is simpler to implement and reason about, and is the AWS SDK default. Use it unless you have a specific reason to use decorrelated.

Why uniform-random and not Gaussian: a Gaussian (normal(mean, stddev)) clusters its samples around the mean, so multiple Gaussian-jittered retries still cluster around 2^n × base / 2. A uniform distribution puts equal probability across the full envelope — every microsecond of [0, 2^n × base] is equally likely — which is exactly the smearing property you need. The cost of getting this wrong is empirical: a Gaussian-jittered retry storm produces a spike that is 3–5× lower than no-jitter but still 4–6× normal load, while uniform-jittered retries flatten to within 1.2× normal load.

Measuring it — the retry-storm simulator

The script below simulates 1000 callers hitting a backend that has a 200 ms outage at t=2.0s. We compare four strategies — no retry, no jitter, equal jitter, full jitter — by measuring the backend's peak instantaneous RPS during recovery.

# retry_storm.py — measures the backend's view of four retry strategies
import asyncio, random, time, math, statistics
from collections import defaultdict

NUM_CALLERS = 1000
BASE_MS = 100
OUTAGE_START, OUTAGE_DUR = 2.0, 0.2  # backend down 2.0–2.2s
SIM_END = 6.0
random.seed(42)

class Backend:
    def __init__(self): self.served_at = []  # list of timestamps
    async def call(self, t_now):
        if OUTAGE_START <= t_now <= OUTAGE_START + OUTAGE_DUR:
            raise RuntimeError("503")
        self.served_at.append(t_now)
        await asyncio.sleep(0.005)

async def caller_with_strategy(backend, start_t, strategy, max_attempts=6):
    for attempt in range(max_attempts):
        t_now = time.perf_counter() - start_t
        try:
            await backend.call(t_now)
            return "ok", attempt
        except RuntimeError:
            if attempt == max_attempts - 1: return "fail", attempt
            envelope_ms = BASE_MS * (2 ** attempt)
            if strategy == "none":         wait_ms = envelope_ms       # no jitter
            elif strategy == "equal":      wait_ms = envelope_ms / 2 + random.uniform(0, envelope_ms / 2)
            elif strategy == "full":       wait_ms = random.uniform(0, envelope_ms)
            elif strategy == "no_retry":   return "fail", attempt
            await asyncio.sleep(wait_ms / 1000)

async def run(strategy):
    backend = Backend()
    start = time.perf_counter()
    # 1000 callers all see the outage at roughly t=2.0s
    tasks = []
    for i in range(NUM_CALLERS):
        await asyncio.sleep(0.002)  # 500 RPS arrival
        tasks.append(asyncio.create_task(caller_with_strategy(backend, start, strategy)))
    await asyncio.gather(*tasks)
    # bin the served-at timestamps into 50ms buckets
    buckets = defaultdict(int)
    for t in backend.served_at: buckets[int(t * 20)] += 1   # 20 buckets/sec = 50ms each
    peak = max(buckets.values()) if buckets else 0
    peak_rps = peak * 20  # convert per-bucket to per-second
    return peak_rps

for s in ["no_retry", "none", "equal", "full"]:
    print(f"strategy={s:10s}  peak_rps_during_recovery={asyncio.run(run(s)):4d}")

Sample run on a M2 MacBook Air, Python 3.11:

strategy=no_retry    peak_rps_during_recovery= 540
strategy=none        peak_rps_during_recovery=4380
strategy=equal       peak_rps_during_recovery=1820
strategy=full        peak_rps_during_recovery= 720

Per-line walkthrough. Backend.call raises 503 only during the outage window; otherwise serves a normal request in 5 ms. caller_with_strategy is the retry loop — note that envelope_ms = BASE_MS * (2 ** attempt) grows as 100, 200, 400, 800, 1600, 3200, the standard exponential schedule. The four strategies differ only in the line that computes wait_ms — that single line is the entire difference between a 4380 RPS spike and a 720 RPS smear. run issues 1000 calls at 500 RPS; the outage hits roughly 100 of them (the ones in flight during the 200 ms window). buckets discretises served-at timestamps to 50 ms windows, and peak_rps is the highest concurrent RPS the backend actually saw.

Read the output carefully: without retries, the backend's peak is 540 RPS — basically the natural arrival rate. With unjittered retries, peak jumps to 4380 RPS — the recovery wave is 8× normal load. Equal jitter halves it to 1820 — better, but still 3×. Full jitter brings it to 720 — within 1.4× normal, well under the backend's overload threshold. The four-line difference is the difference between "the service self-heals" and "the service falls over again".

Why the no-retry case has 540 RPS peak (not 500): the requests that hit during the outage do not retry, so they fail. But the 5 ms of natural service time means some adjacent requests are still in flight when their neighbours arrive, producing slight clumping at peak. This is the baseline — the natural "cost" of arrival jitter — and is what the retry strategies are trying to stay close to.

What the production retry envelope looks like

Past the simulation, real production retry config has more knobs. The defaults at PaySetu's payments service look like this in their tenacity-based wrapper:

# payments/retry.py — the production retry config
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
import httpx

class TransientError(Exception): pass

@retry(
    stop=stop_after_attempt(4),                  # 4 attempts total = 1 + 3 retries
    wait=wait_random_exponential(multiplier=0.1, max=2.0),   # full jitter, 100ms base, 2s cap
    retry=retry_if_exception_type(TransientError),
    reraise=True,
)
def fetch_fraud_score(merchant_id: str, txn_id: str) -> dict:
    r = httpx.post("http://fraud-score/check", json={"merchant_id": merchant_id, "txn_id": txn_id},
                   timeout=httpx.Timeout(0.150, connect=0.030))
    if r.status_code in (502, 503, 504): raise TransientError(f"transient {r.status_code}")
    if r.status_code == 429: raise TransientError("rate-limited")
    if r.status_code >= 400: r.raise_for_status()    # 4xx other than 429: do NOT retry
    return r.json()

Five things are doing real work here. stop_after_attempt(4) caps total attempts at four — three retries past the initial failure. wait_random_exponential(multiplier=0.1, max=2.0) is tenacity's full-jitter implementation: wait = uniform(0, min(cap, multiplier × 2^attempt)). retry_if_exception_type(TransientError) is critical — only retry on transient errors (5xx, 429, network errors). Never retry on 4xx other than 429: a 400 means your request was wrong, retrying it changes nothing. reraise=True propagates the final exception — the caller is not lied to about the failure. The 150 ms timeout is the per-attempt timeout, not the total budget; total worst case is 4 × 150 ms + 3 × 2000 ms = 6.6 s, which must fit inside the user-facing SLO. Why this matters: a developer who sets stop_after_attempt(8) and max=10s on a service whose user-facing SLO is 2 seconds has built a retry config that cannot succeed within the SLO. The retries will always exhaust the deadline and the user will see a timeout. Sizing total-retry-budget against the SLO is non-negotiable; the rule at PaySetu is total_retry_budget ≤ slo / 3 so the deadline-propagation layer (chapter 45) has room to short-circuit.

Common confusions

"Backoff alone is enough; jitter is just optimization." Backoff alone solves the amplification problem (retries grow over time). It does not solve the synchronisation problem (retries cluster at the same instant). The amplification fix is necessary but the synchronisation fix is the one that actually saves your backend during recovery. Without jitter, your backend faces three giant retry spikes instead of one. With jitter, it faces a smear. The production-impact ratio is 6× in the simulator above.
"More retries = more reliable." For transient errors, yes — up to a small number (3–4 retries). Past that, you are fighting persistent failures that retries cannot fix, and you are amplifying load on a backend that is already struggling. 8 retries is almost always wrong: it means you have built a system that requires 8 attempts to make 1 succeed, and the right fix is in the backend (capacity, fault isolation, or fail-fast) not in more attempts. The tenacity pattern at PaySetu caps at 4. Past 4 retries, the call fails to the next layer (circuit breaker, fallback, or user-facing error).
"Retrying a POST is safe if it failed." Wrong without idempotency. A POST that times out from the caller's perspective may have succeeded on the server — the response packet was lost. Retrying it without an idempotency key creates a duplicate. PaySetu lost ₹2.4 lakh to a single duplicate-charge incident in 2023 caused by this exact pattern: payment service retried after a 504, the original had succeeded, the merchant was double-charged. The fix is idempotency keys at the API boundary (chapter 44) — never retries themselves.
"Per-caller max_retries=3 prevents retry storms." It bounds each caller's retries to 3, but if you have 1000 callers, the fleet emits up to 3000 retries during recovery. The fleet-level cap requires a retry budget — a per-service rule like "retries cannot exceed 10% of total RPS in the last 60 seconds" — typically enforced at the service mesh or RPC framework, not in application code. Google SRE book §22 describes the math: a 10% budget converts a worst-case 4× retry storm into a 1.1× ceiling.
"Jitter just means adding random delay." Jitter is uniform-random over the full envelope, not Gaussian noise around a target. A Gaussian (normal(mean=100, stddev=20)) clusters near 100 ms — the retries still cluster, just with slight blur. Uniform random over [0, envelope] puts equal probability across the full window, which is the property that flattens the spike. If your jitter library uses Gaussian, your retry traffic is still spiky.

Going deeper

Where the math comes from — the backoff envelope as a queueing-theory tool

The choice of "exponential" backoff is not aesthetic; it comes from a queueing-theory result. If a backend with capacity μ is in overload (arrival rate λ > μ), the queue length grows linearly with time. The only way for callers to push λ back below μ is to slow themselves down. Exponential backoff is the simplest schedule that strictly dominates linear backoff: after k attempts, the cumulative wait time is 2^k × base, which grows fast enough to guarantee the queue can drain. Chuck Lever's 1998 paper "Improving the Performance of UDP-Based Retransmission Schemes" formalised this for the network-protocol setting; the result transfers directly to RPC backoff. Linear backoff (wait = k × base) does not strictly dominate — if μ is barely below λ, the linear schedule is too slow to drain. Exponential is the conservative-but-correct choice.

The CricStream playbook — retries, but also backpressure

CricStream's video segment service handles 25 M concurrent viewers during a cricket final. When the playback CDN hiccups for 100 ms, every player's segment request fails. A naive retry would put 25 M retries on the origin server in 100 ms — terminal. CricStream's player implements three things together: retries with full jitter, capped at 2 attempts; a service-side Retry-After response header that tells the player to wait at least N seconds before retrying; and a client-side hard cap of 1 retry per 30 seconds for the same segment. The result, measured during the 2024 IPL final: the 100 ms origin blip translated to a peak of 1.8 M retries spread over 4 seconds — well within origin capacity. The lesson: retries are a contract between client and server, and the server has the final say. Retry-After is the simplest backpressure mechanism that does not require a service mesh. Why Retry-After from the server beats client-side caps alone: the client's view of "what is reasonable" is local (its own load), while the server knows the fleet's load. A client retrying after 200 ms is reasonable per-caller and catastrophic at scale. Retry-After: 5 from the server says "I know my own state; wait 5 seconds before any caller retries". The client should always respect a server-issued retry hint.

Decorrelated jitter — the AWS variant Brooker preferred slightly

wait = min(cap, uniform(base, previous_wait × 3)) — Brooker's measurement showed this converges in slightly fewer retries than full jitter, with similar spike-flattening. The trick is that each retry's wait depends on the previous wait, so retries that happened to wait long once are likely to wait long again, building memory into the schedule. The botocore SDK uses this as its default (look up botocore.retries.standard.ExponentialBackoff). For most non-AWS services, full jitter is simpler and within 5% of decorrelated jitter on every metric — use full unless you have benchmark evidence that decorrelated wins for your specific workload.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install tenacity httpx
# save the retry_storm.py code above
python3 retry_storm.py
# Expected output: no_retry≈540, none≈4400, equal≈1800, full≈700 (±100 RPS)
# The 6× spread between 'none' and 'full' is the point of the chapter.

If you get spreads less dramatic than 4× between none and full, increase NUM_CALLERS to 5000 and verify your event loop is not the bottleneck (run htop while the simulator runs; if python3 is at 100% CPU you are CPU-bound, not RPC-bound, and the spike is being clipped by the simulator itself).

Where this leads next

Retries are one piece of the reliability layer; they compose with the patterns in the next several chapters:

Circuit breakers (Hystrix, Sentinel) — chapter 43; when retries persistently fail, stop trying.
Bulkheads — chapter 44; isolate one service's retries from another service's pool.
Timeouts and deadline propagation — chapter 45; the per-attempt timeout that bounds retry waste.
Idempotency keys at the API boundary — chapter 44 prerequisite; without it, retrying a write can double-execute.
Hedged requests for the long tail — a related but distinct mechanism: parallel attempts for latency, not failure.

The pattern that ties them together is the retry budget — a fleet-wide cap, typically 10%, on the fraction of traffic that may be retries. Service meshes (Envoy, Linkerd) implement this at the proxy layer; without it, even perfect backoff and jitter cannot prevent a catastrophic retry storm under sustained backend failure.

References

Marc Brooker, "Timeouts, retries, and backoff with jitter" — AWS Builder's Library — the practitioner reference; full jitter vs equal jitter vs decorrelated jitter with measurements.
Marc Brooker, "Exponential Backoff and Jitter" — AWS Architecture Blog (2015) — the original blog post that established the four jitter strategies.
Google, Site Reliability Engineering — chapter 22 "Addressing Cascading Failures" — the retry budget and the 10% rule.
Chuck Lever, "Improving the Performance of UDP-Based Retransmission Schemes" — Citi Tech Report 1998 — the queueing-theory derivation of why exponential dominates linear.
Dean & Barroso, "The Tail at Scale" — CACM 2013 — the broader latency-tail context that motivates retries (and hedged requests).
Wall: most of what you send breaks somewhere — the immediately previous chapter; the failure modes retries are designed to handle.
tenacity — Python retry library — the production-grade implementation referenced in the PaySetu code sample; uses full jitter by default.