Circuit breakers (Hystrix, Sentinel)

PaySetu's checkout service holds 200 worker threads. At 14:32:11 the fraud-score backend's p99 latency walks from 80 ms to 4.8 s — a downstream Postgres replica is doing a long-running vacuum. The retry layer (chapter 42) does its job: full jitter, four attempts capped at 6.6 s. Within 30 seconds, every one of checkout's 200 worker threads is parked inside fetch_fraud_score(), waiting on a backend that is not coming back soon. The checkout pod's /health endpoint stops responding because there are no threads left to handle it. Kubernetes marks the pod unhealthy and kills it. The replacement pod boots, sees the same fraud-score latency, fills its threads in 30 seconds, gets killed. The cascade hits all 24 checkout pods in 12 minutes. By 14:44, checkout is down. The fraud-score backend was never down — it was just slow. Retries amplified slow into terminal because nothing told the caller "stop trying for a while".

A circuit breaker is a per-callee state machine — closed, open, half-open — that fails fast when the failure rate crosses a threshold, gives the backend a timeout window to recover, then probes cautiously before resuming traffic. It does not improve the backend; it protects the caller's threads, queues, and SLO budget from being consumed by a backend that has stayed broken. Hystrix popularised the pattern; Sentinel and Resilience4j replaced it; the state machine is the same.

Why retries alone are not enough

Retries handle the case where a backend is transiently broken — a 200 ms config reload, a single dropped TCP connection, a brief GC pause. The shape of that failure is short: the backend recovers on its own within a few hundred milliseconds, and the retry envelope (base × 2^n with jitter) covers the gap. But there is a different shape of failure — slow rather than fast — that retries actively make worse:

The backend's p99 latency walks from 80 ms to 4.8 s while still returning 200s.
The caller's per-attempt timeout is 150 ms, so it sees timeouts and retries.
With four attempts capped at 6.6 s total, every call now blocks the calling thread for up to 6.6 s instead of completing in 80 ms.
The caller's thread pool — sized for the 80 ms case — fills with requests parked in retry sleep.
Once threads are exhausted, even calls that would succeed cannot start, because there is no thread to run them. The caller's RPS to other healthy backends collapses too.

This is resource starvation by association. The fraud-score backend's slowness consumed every checkout thread, even though most of those threads' jobs had nothing to do with fraud-score. The user trying to look at their order history sees a 503 from checkout because checkout has no threads — not because order-history is broken.

Illustrative — not measured data. Three states, four transitions. The breaker is a function of the recent failure rate; it has no memory beyond the sliding window.

The circuit breaker, popularised by Netflix's Hystrix library in 2012 and now standard in Sentinel (Alibaba), Resilience4j (JVM), and pybreaker (Python), encodes a deal: when failures cross a threshold, stop trying. Return a fast error or a fallback. Wait long enough for the backend to recover. Then probe. Why "stop trying" is the correct response and not "try harder": the backend's slowness is usually caused by a resource bottleneck (CPU, connection pool, DB lock contention). Adding load — which is what retries do — makes the bottleneck worse, not better. Removing load is what gives the backend a chance to drain its own queues. A circuit breaker is the only standard pattern that actively reduces offered load during a failure, which is why it composes with retries instead of replacing them.

The state machine — closed, open, half-open

A circuit breaker is a state machine attached to a single (caller, callee, operation) tuple. Same caller → fraud-score → check() is one breaker; same caller → fraud-score → enroll() is a different one. The states:

Closed. Traffic flows normally. The breaker observes outcomes (success / failure / timeout). It maintains a sliding-window failure rate. As long as the rate is under the trip threshold, traffic continues.
Open. The breaker has tripped. Every call returns immediately with a fast error (or invokes a configured fallback) without touching the backend. The caller's threads are not consumed. A timer is set for the sleep window — typically 10–60 seconds.
Half-open. The sleep window has elapsed. The breaker allows a small, capped number of probe calls through (typically 1–5). If they all succeed, the breaker transitions back to closed. If any probe fails, it transitions back to open and the sleep window restarts.

Three numbers parameterise the policy:

Parameter	Typical	What it controls
Failure-rate threshold	50% over last 20 calls	How tolerant before tripping. Lower = trips earlier on noisy backends
Sleep window	30 s	How long to give the backend before probing
Probe count	5 consecutive successes	How sure to be before resuming full traffic

A common variant adds a slow-call threshold: a call that takes longer than (say) 1 s counts as a failure even if it returns 200. This catches the "walking p99" failure mode where the backend technically returns success but is so slow that its slowness is causing thread starvation upstream — the exact case that triggered PaySetu's checkout cascade.

Building one — the breaker, end to end

# circuit_breaker.py — a working sliding-window circuit breaker
import time, random, threading
from collections import deque
from enum import Enum

class State(Enum):
    CLOSED, OPEN, HALF_OPEN = "closed", "open", "half_open"

class CircuitBreaker:
    def __init__(self, fail_rate=0.5, window=20, sleep_s=30, probe_n=5, slow_ms=1000):
        self.fail_rate, self.window = fail_rate, window
        self.sleep_s, self.probe_n, self.slow_ms = sleep_s, probe_n, slow_ms
        self.outcomes = deque(maxlen=window)   # 1 = ok, 0 = fail
        self.state = State.CLOSED
        self.opened_at = 0.0
        self.probe_successes = 0
        self.lock = threading.Lock()

    def _record(self, ok):
        self.outcomes.append(1 if ok else 0)

    def _failure_rate(self):
        if len(self.outcomes) < self.window: return 0.0
        return 1.0 - (sum(self.outcomes) / self.window)

    def call(self, fn, *args, **kw):
        with self.lock:
            if self.state == State.OPEN:
                if time.time() - self.opened_at >= self.sleep_s:
                    self.state, self.probe_successes = State.HALF_OPEN, 0
                else:
                    raise CircuitOpenError("breaker open; failing fast")
            if self.state == State.HALF_OPEN and self.probe_successes >= self.probe_n:
                self.state = State.CLOSED                  # already enough probes; should not reach
        t0 = time.perf_counter()
        try:
            result = fn(*args, **kw)
            elapsed_ms = (time.perf_counter() - t0) * 1000
            ok = elapsed_ms < self.slow_ms                  # slow = failure
            with self.lock:
                self._record(ok)
                if self.state == State.HALF_OPEN:
                    if ok: self.probe_successes += 1
                    if self.probe_successes >= self.probe_n: self.state = State.CLOSED
                    elif not ok:
                        self.state, self.opened_at = State.OPEN, time.time()
                        raise CircuitOpenError("probe failed; re-opening")
                elif self.state == State.CLOSED and self._failure_rate() >= self.fail_rate:
                    self.state, self.opened_at = State.OPEN, time.time()
            if not ok: raise SlowCallError(f"slow: {elapsed_ms:.0f}ms")
            return result
        except Exception:
            with self.lock:
                self._record(False)
                if self.state == State.HALF_OPEN:
                    self.state, self.opened_at = State.OPEN, time.time()
                elif self.state == State.CLOSED and self._failure_rate() >= self.fail_rate:
                    self.state, self.opened_at = State.OPEN, time.time()
            raise

class CircuitOpenError(Exception): pass
class SlowCallError(Exception): pass

# --- demo: simulate a backend that fails 80% from t=2s to t=8s, then recovers
def backend(t_start):
    t = time.time() - t_start
    if 2.0 <= t <= 8.0 and random.random() < 0.8:
        raise RuntimeError("503")
    time.sleep(0.005)
    return "ok"

random.seed(7)
cb = CircuitBreaker(fail_rate=0.5, window=10, sleep_s=2, probe_n=3, slow_ms=1000)
start = time.time()
hist = []
for i in range(60):
    try: cb.call(backend, start); hist.append((round(time.time()-start, 2), cb.state.value, "ok"))
    except CircuitOpenError: hist.append((round(time.time()-start, 2), cb.state.value, "fast-fail"))
    except Exception: hist.append((round(time.time()-start, 2), cb.state.value, "fail"))
    time.sleep(0.2)

for h in hist[::4]: print(f"t={h[0]:5.2f}s  state={h[1]:9s}  outcome={h[2]}")

Sample run on a M2 MacBook Air, Python 3.11:

t= 0.00s  state=closed     outcome=ok
t= 0.81s  state=closed     outcome=ok
t= 1.61s  state=closed     outcome=ok
t= 2.42s  state=closed     outcome=fail
t= 3.22s  state=open       outcome=fast-fail
t= 4.03s  state=open       outcome=fast-fail
t= 4.83s  state=open       outcome=fast-fail
t= 5.64s  state=half_open  outcome=fail
t= 6.44s  state=open       outcome=fast-fail
t= 7.25s  state=open       outcome=fast-fail
t= 8.05s  state=half_open  outcome=ok
t= 8.86s  state=closed     outcome=ok
t= 9.66s  state=closed     outcome=ok
t=10.47s  state=closed     outcome=ok
t=11.27s  state=closed     outcome=ok

Per-line walkthrough. outcomes = deque(maxlen=window) is the sliding window — appending past window evicts the oldest entry, giving an O(1) failure-rate calculation. _failure_rate waits until at least window calls have happened before reporting a real rate; this avoids tripping the breaker on the first two failures of an empty window. call(fn, ...) is the wrapper — every protected call goes through this path, and the lock makes it safe under concurrent callers. The OPEN-state branch at the top fails fast without touching fn; this is what saves the caller's threads. elapsed_ms < self.slow_ms is the slow-call check — the breaker treats a slow call as a failure, which catches the walking-p99 mode. probe_successes counts consecutive successes in HALF-OPEN; one failure resets the breaker to OPEN immediately, restarting the sleep window. Read the trace: the breaker stays CLOSED for the first 2 seconds, trips OPEN once the 80%-failure regime kicks in, fails fast for the sleep window, attempts a probe at t=5.6s while the regime is still active (probe fails, breaker re-opens), retries at t=8.0s after the regime ends (probe succeeds, breaker closes). Why probes fail fast when they fail: the half-open state allows exactly N probes through; the first failure flips the state back to OPEN and resets the timer, so the next call (and the 50 after it) all see fast-fail without consuming a thread. This is what stops a still-broken backend from absorbing 5 probe calls + their subsequent retries every sleep_window.

What Hystrix and Sentinel actually do differently

Hystrix (Netflix, 2012, in maintenance since 2018) and Sentinel (Alibaba, open-source 2018) are the two production-grade circuit-breaker implementations most engineers have used. They share the state machine above but differ on three axes:

Axis	Hystrix	Sentinel
Window	Rolling 10s of 1s buckets (counters)	Sliding window of N calls OR sliding time window
Trip metric	Error count + error percentage	Error count, error percentage, slow-call ratio, RT
Bulkhead	Mandatory thread-pool isolation per command	Optional; semaphore by default
Fallback	First-class `getFallback()` method	First-class `blockHandler` / `fallback` annotations
Adaptive trip	No (fixed thresholds)	Yes — flow control + system-load shedding (BBR-inspired)

Hystrix's signature contribution was bulkhead by default — every protected command got its own thread pool, sized small (10–20 threads), so even if the breaker did not trip, the caller's main pool was protected by the bulkhead alone. The cost is context-switch overhead and thread-pool sprawl in services that call hundreds of downstream commands. Sentinel went the other way: semaphore isolation by default (cheap), with thread-pool isolation as an opt-in. Sentinel also added adaptive trip rules — the system can shed load when the host's load average crosses a threshold, regardless of any single backend's failure rate. Why semaphore vs thread-pool matters: a semaphore-isolated breaker counts in-flight calls; if the count exceeds the limit, calls fast-fail. There is no extra thread switch. A thread-pool-isolated breaker submits the call to a separate executor; this does add latency (typically 0.1–1 ms per call) but lets the caller's main thread proceed even when the protected backend's calls are stuck. The right choice depends on whether the protected calls are blocking (use a thread pool) or non-blocking async (use a semaphore). For modern Python asyncio or Java CompletableFuture code, semaphore is almost always the right answer.

Common confusions

"A circuit breaker replaces retries." They compose. The retry handles transient failures (200 ms blip); the breaker handles sustained failures (10-minute slowness). The order is breaker → retry from the outside: the breaker is checked first, fast-failing the entire retry envelope when it would otherwise consume threads. If you call retry-first then breaker, you waste 6 seconds of retry-sleep per call when the breaker has already decided to fail fast.
"A circuit breaker improves the backend." It does the opposite — by reducing the load offered to the backend. The breaker improves the caller's posture (saves threads, returns faster errors, preserves SLO budget). The backend just gets less traffic, which gives it room to recover. Engineers sometimes argue that "the breaker should be on the server"; this is wrong — server-side rate limiting (chapter 50) is a different mechanism that protects the server from too many callers. The breaker protects the caller from a degraded server.
"Open the breaker on the first failure." That is too aggressive — a single random failure (a TCP RST, a load-balancer hiccup) flips the breaker, fast-failing 30 seconds of traffic for nothing. The trip threshold needs a window — 50% over last 20 calls is the typical default. KapitalKite's broker-routing service tried fail_rate=0.1, window=5 early on; the breaker tripped 47 times per hour during a normal trading day, every trip flushing 30 seconds of legitimate orders. They moved to 0.5 / 20; trips dropped to 2 per hour, all real.
"Hystrix and Sentinel and Resilience4j are interchangeable." Same state machine, different operational properties. Hystrix uses thread-pool isolation and is in maintenance — do not start new projects on it. Resilience4j is the JVM successor (semaphore-default, functional API, lightweight). Sentinel adds adaptive load-shedding and a flow-control layer that goes beyond per-callee breaking. For Python, pybreaker and circuitbreaker packages cover the basics; for production-grade, ports of Resilience4j's API exist (aiocircuitbreaker).
"The breaker's threshold should be tuned to the backend's normal failure rate." Mostly right, but the more important thing is the trip-window size. A backend that normally errors at 0.1% needs a 50% trip threshold over a 100+ call window — small windows are noisy and the breaker will flap. Hystrix's default is requestVolumeThreshold=20; Sentinel's is minRequestAmount=5. Below the volume threshold, the breaker does not trip even if 100% of (the few) calls fail, which prevents low-traffic breakers from flapping on single isolated failures.
"The breaker should fall back to a cached value." Sometimes — fraud-score has no safe fallback (you cannot guess), but a UI service rendering "recently viewed items" can fall back to the empty list. The fallback decision is business-domain, not infrastructure. The breaker's job is to fail fast; the fallback (or no fallback) is the caller's contract with the user. Hystrix's getFallback() is a hook, not a guarantee — write the fallback only when you know what "good enough degraded behaviour" looks like, and label it clearly in observability.

Going deeper

Why Hystrix went into maintenance — and what replaced it

Netflix put Hystrix into maintenance mode in 2018, which surprised many teams that had built their service mesh around it. The reason was not that the pattern was wrong; the pattern is correct and remains the industry default. The reason was that Hystrix's thread-pool-per-command model — its signature feature — was too heavy for modern services that called hundreds of downstream operations. A service with 200 distinct Hystrix commands ran 200 small thread pools, each with its own queue, each contending for the JVM's GC budget. Resilience4j replaced it with the same state machine, semaphore-default, no thread pools. The lesson is operational: the most-performant pattern at one architectural era (10-thread per pool was reasonable when services had 20 commands) becomes pathological at the next era (200 thread pools is a GC nightmare). The state-machine logic is timeless; the isolation primitive is era-dependent.

MealRush's checkout breaker — the slow-call threshold mattered more than the failure-rate

MealRush's order-placement service calls a downstream restaurant-availability API. The API returns 200 OK regardless of whether the restaurant is actually online; "offline" is encoded in the response body. During the 12:30 PM lunch rush, the availability API's p99 walked from 60 ms to 2.4 s as its DB connection pool saturated. Failure rate by HTTP status: 0%. Failure rate by actual error: 0%. The standard breaker did not trip. But order-placement's worker threads filled up because every call took 2.4 s instead of 60 ms. The fix was to enable Sentinel's slowCallRatioThreshold — calls slower than 800 ms count as failures for breaker purposes. With that flag set, the breaker tripped within 10 seconds of the latency walk, fast-failed 25 seconds of traffic, and order-placement's thread pool stayed within bounds. Why slow calls are the more dangerous failure mode in practice: explicit-error failures (5xx, network resets) are loud and trigger every alarm in your stack. Slow-but-successful calls are quiet — the dashboard shows "0% errors" because the calls technically succeed. They consume threads and SLO budget without showing up in any failure metric. The slow-call threshold is the breaker's only defence against this class of failure, and turning it on by default is one of the few unambiguous "always do this" patterns in reliability engineering.

The interaction with retries — why order matters

The right way to compose the two patterns:

caller → [circuit breaker check] → [retry envelope] → [actual RPC]

The breaker check happens outside the retry envelope. When the breaker is open, the call fast-fails immediately, before any retry sleeps. When the breaker is closed, the retry envelope runs as normal, and each individual attempt's outcome (success / fail / slow) updates the breaker's window. The wrong way is retry → breaker, which means each retry attempt independently checks the breaker; if the breaker opens during the retry sleep, the retry envelope still runs to exhaustion because the outer retry loop does not see the state change. Tenacity's retry_if predicates can check breaker state, but the cleaner pattern is breaker(retry(rpc)) — breaker on the outside, retry on the inside.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install pybreaker tenacity httpx
# save circuit_breaker.py from the article body
python3 circuit_breaker.py
# Expected: trace shows CLOSED → OPEN at ~t=2.5s, fast-fail through t=8s,
# probe attempts in HALF_OPEN, full recovery to CLOSED by t=9s.

If your trace shows the breaker oscillating (open → half-open → open repeatedly), increase window from 10 to 20 — small windows are noisy and the breaker will flap on any single bad probe. The default Hystrix requestVolumeThreshold is 20 for this reason.

Where this leads next

Circuit breakers are the protective half of the reliability layer; the next chapters complete the picture:

Bulkheads — chapter 44; isolate one downstream's threads from another's so a single bad backend cannot starve the whole caller.
Timeouts and deadline propagation — chapter 45; without a per-call timeout the breaker has no slow-call signal, and without deadline propagation a retry can outlive its own SLO.
Idempotency keys at the API boundary — the prerequisite for safely retrying writes through a breaker.
Hedged requests for the long tail — the latency-driven sibling: parallel attempts when the backend is healthy but slow.
Retries: exponential backoff, jitter — the previous chapter; the retry envelope that the breaker fast-fails when conditions warrant.

The composition that ships in production looks like: bulkhead → breaker → retry → timeout → RPC. Each layer guards a different failure mode. Take any one layer out and the production-incident postmortem writes itself.

References

Netflix Tech Blog, "Introducing Hystrix for Resilience Engineering" (2012) — the original announcement; reads as the design rationale for the pattern.
Netflix, "Hystrix Wiki — How it Works" — definitive reference for the closed/open/half-open semantics and bulkhead model.
Alibaba Sentinel — Circuit Breaking docs — slow-call ratio, RT-based breaking, adaptive load shedding.
Resilience4j — CircuitBreaker module — the JVM successor; explains the move from thread-pool to semaphore isolation.
Michael Nygard, Release It! (Pragmatic Bookshelf, 2nd ed. 2018) — Chapter "Stability Patterns" introduces the term "circuit breaker" in the context that Hystrix later popularised.
Marc Brooker, "Timeouts, retries, and backoff with jitter" — AWS Builder's Library — adjacent practitioner reference; framing for retry/breaker composition.
Retries: exponential backoff, jitter — the immediately preceding chapter; the retry layer the breaker wraps.
Dean & Barroso, "The Tail at Scale" — CACM 2013 — the broader latency-tail context that motivates breakers and hedged requests together.