Bulkheads

CricStream's match-page service makes 9 downstream calls to render a single screen during a final: scorecard, commentary, predictions, ads, user profile, fantasy-team, push-token, video-manifest, recommended-clips. Eight of them are healthy. The ninth — recommended-clips — has a Redis replica doing a slow snapshot. Its p99 walks from 30 ms to 3.5 s. The match-page service has a 400-thread Tomcat pool. Within 9 seconds, more than 380 of those 400 threads are blocked inside getRecommendedClips(), waiting on the slow Redis. The other 8 backends are still healthy — scorecard, commentary, ads, all responding in under 50 ms — but the match-page service has no threads left to call them. /health stops responding. The orchestrator kills the pod. 67 sibling pods cascade. The circuit breaker for recommended-clips trips eventually, but by then the service is already down. The breaker was watching the wrong layer; the threads ran out before the failure rate ever crossed 50%.

A bulkhead is a hard cap on how many in-flight calls one downstream — or one workload class — can hold at once, enforced by a per-callee semaphore or thread pool. The name comes from ship hulls: a ruptured compartment floods that compartment alone, not the whole vessel. In a service, the bulkhead caps the damage radius of a single sick backend so other calls keep flowing. It is the partner of the circuit breaker, not a substitute — the breaker decides when to stop calling, the bulkhead caps how many threads can be parked inside a slow call right now.

Why the breaker is not enough — the failure that runs out the clock

Circuit breakers (chapter 43) react to a sliding window of failure rate. They need evidence before they trip — typically 50% failure over the last 20 calls. That logic is sound when failures are loud (5xx, RST, timeout) and the call rate is high enough to fill the window quickly. But two failure modes leave the breaker with nothing to react to:

The slow but successful call. Recommended-clips returned 200 OK every time during CricStream's incident — Redis was slow, not broken. The breaker's failure-rate metric stayed at 0%. Even with the slow-call threshold turned on, the breaker needed 20 calls' worth of evidence to trip; at one call per match-page render, that took longer than the time required to drain the 400-thread pool.
The first burst of failure during a low-traffic window. PaisaCard's rewards-redemption service averages 4 RPS. Its breaker is configured requestVolumeThreshold=20 — sensible, because a smaller window flaps. When the loyalty-points backend goes slow, the breaker sees 4 slow calls per second; it takes 5 seconds to fill the window. Five seconds of slow calls at 6.6 s each parks ≈30 threads. The pool is 50 threads. The breaker trips at second 5 and saves the remaining 20, but the damage is already done — /health is failing because not enough threads remain to handle the kubelet probe.

The bulkhead solves both. It does not need evidence. It does not have a sliding window. It is a counter: "no more than N calls into this downstream, ever, regardless of why". The (N+1)th call fast-fails before it consumes a thread. Why a counter beats a failure-rate window: the breaker observes outcomes that have already completed; the bulkhead observes calls that are still in flight. A pool can drain in seconds while the breaker is still gathering its 20-call window. Counters react in O(1) time; sliding windows react in O(window-size × call-period) time. For low-traffic, slow-failure cases, the latter is asymptotically too late.

Illustrative — not measured. Each downstream has its own semaphore that caps in-flight calls into it. A sick downstream can saturate its own lane but cannot drain the shared 400-thread pool below the sum of caps it does not own.

Illustrative — not measured data. The bulkhead's job is visible as the gap between "saturated at 400" and "plateau at ≈90". The plateau is the price of doing business; the saturation is the cost of skipping the bulkhead.

The two flavours — semaphore bulkhead vs thread-pool bulkhead

Bulkheads come in two implementations. The choice depends on whether your protected calls block a thread or run on an event loop.

Semaphore bulkhead. A counter — usually Semaphore in Java, asyncio.Semaphore in Python, chan struct{} of fixed length in Go. Each call acquires a permit before invoking the downstream and releases it on return. If no permit is available, the call fast-fails immediately. The acquire/release is essentially free — a pair of atomic increments and a contention check. This is the right choice when:

The protected call does not block the calling thread (async / non-blocking I/O), or
You are happy for a saturated bulkhead to back-pressure the caller's thread (the thread fast-fails and returns to its pool quickly), or
You want zero context-switch overhead.

Thread-pool bulkhead. A separate, fixed-size executor for each protected downstream. The caller's request thread submits a task to the executor and waits for the future (with timeout). When the executor is full and its queue is full, submission throws and the call fast-fails. The cost is real: a context switch on every call (typically 2–5 μs on Linux), and a separate scheduler queue per downstream. The benefit is real too: the caller's main thread is never parked inside the slow call, only the executor's worker is — so saturation of one executor cannot starve the request-handler pool. This is the right choice when:

The protected call is synchronous / blocking, AND
You cannot afford the caller's request-handler threads to pile up inside it, AND
You can afford the per-call context-switch cost.

	Semaphore	Thread pool
Per-call overhead	~50 ns (atomic ops)	~2–5 μs (context switch + queue)
Memory per bulkhead	1 counter	N threads × stack (typ. 1 MB each)
Caller thread parked in slow call?	Yes (until permit times out)	No (only worker thread)
Default in Hystrix	Available, opt-in	Mandatory
Default in Sentinel / Resilience4j	Default	Opt-in
Right for async code?	Yes	No (defeats async)
Right for blocking JDBC / synchronous HTTP?	Acceptable	Better

Hystrix's mandatory thread-pool-per-command is what made it heavy at scale (chapter 43 explained why Netflix retired it). Resilience4j and Sentinel both default to semaphore bulkheads now, which is the right call for modern async code where the calling thread is not pinned to the request anyway.

Building one — a working semaphore bulkhead

# bulkhead.py — semaphore bulkhead with fast-fail and timed acquire
import time, random, threading
from contextlib import contextmanager

class BulkheadFullError(Exception): pass

class Bulkhead:
    def __init__(self, name, limit, acquire_timeout_ms=0):
        self.name = name
        self.limit = limit
        self.acquire_timeout_s = acquire_timeout_ms / 1000.0
        self.sem = threading.Semaphore(limit)
        self.in_use = 0
        self.rejected = 0
        self.lock = threading.Lock()

    @contextmanager
    def acquire(self):
        ok = self.sem.acquire(timeout=self.acquire_timeout_s) if self.acquire_timeout_s > 0 \
             else self.sem.acquire(blocking=False)
        if not ok:
            with self.lock: self.rejected += 1
            raise BulkheadFullError(f"{self.name} full: {self.in_use}/{self.limit}")
        with self.lock: self.in_use += 1
        try: yield
        finally:
            with self.lock: self.in_use -= 1
            self.sem.release()

# --- demo: 200 callers, downstream becomes slow at t=1s
def downstream(t_start):
    if time.time() - t_start > 1.0: time.sleep(2.0)   # slow regime
    else: time.sleep(0.01)
    return "ok"

def caller(bh, t_start, results, idx):
    try:
        with bh.acquire():
            downstream(t_start)
            results[idx] = "ok"
    except BulkheadFullError:
        results[idx] = "fast-fail"

random.seed(3)
bh = Bulkhead("clips", limit=30, acquire_timeout_ms=0)   # zero = non-blocking acquire
start = time.time()
threads, results = [], [None] * 200
for i in range(200):
    t = threading.Thread(target=caller, args=(bh, start, results, i))
    threads.append(t); t.start()
    time.sleep(0.01)   # 100 RPS arrival rate
for t in threads: t.join()

ok = sum(1 for r in results if r == "ok")
ff = sum(1 for r in results if r == "fast-fail")
print(f"ok={ok}  fast-fail={ff}  rejected_total={bh.rejected}")

Sample run on a M2 MacBook Air, Python 3.11:

ok=104  fast-fail=96  rejected_total=96

Per-line walkthrough. Bulkhead(name, limit, acquire_timeout_ms) is the policy: limit=30 is the cap, acquire_timeout_ms=0 means non-blocking — if the semaphore is full, fail immediately rather than waiting in line. @contextmanager lets the call site say with bh.acquire(): rpc() so that release happens on every exit path including exceptions. self.sem.acquire(blocking=False) is the fast-fail path; it returns False instantly if all 30 permits are out. The in_use counter is for observability — it shows the dashboard how saturated each bulkhead is right now, which is the metric you alert on. Read the result: in the first second, calls complete in 10 ms, the bulkhead never saturates, all callers succeed. After t=1s the downstream becomes slow; with 30 permits and 2-second calls, the bulkhead saturates at the 30th in-flight call and every subsequent caller fast-fails until permits are released. The 96 fast-fails protected the caller — those threads returned to their pool in microseconds instead of being parked for 2 seconds. Why fast-fail is correct here even though it returns "fast-fail" for some users: the alternative is that those 96 threads pile up, the caller's pool runs out, and every other caller (calling other healthy backends) starts to fail too. The bulkhead converts "everyone times out at 2s" into "30 users get the slow service, 96 users get an instant degraded response, the rest of the system stays healthy". The total user-perceived availability is higher with the bulkhead than without.

Sizing the bulkhead — Little's Law tells you the answer

The hardest question with bulkheads is "what should the limit be?" The answer is not arbitrary; Little's Law gives you a closed form. For a downstream with average steady-state latency L seconds and offered traffic R requests per second, the average concurrency is L × R. To carry the steady-state load with no rejections, the bulkhead limit must be ≥ L × R. To carry the peak load, it must be ≥ L_peak × R_peak. Above that, the bulkhead has headroom; below that, it rejects healthy traffic.

The rule of thumb that actually works in production:

limit = ceil(p99_latency_seconds × peak_RPS × 1.5)

The 1.5 factor is the headroom for short bursts. Worked example for CricStream's recommended-clips: average latency 30 ms, p99 latency 80 ms, peak 400 RPS during a final. 0.080 × 400 × 1.5 = 48. Round to 50. Why use p99 not average: average concurrency is what the system holds in steady state, but a sized-to-average bulkhead will reject during the natural variance around the average. Little's Law applies to a stable distribution; production has spikes. p99-sized bulkheads carry the burst without rejecting healthy traffic, while still capping the damage radius if the downstream goes slow (a 10× slowdown turns a 50-permit bulkhead into a 50-permit bulkhead — the limit does not move, only the concurrency at the limit does, which is exactly what you want).

The other consideration is the caller's total pool. Sum of all bulkhead limits should ideally not exceed (caller pool size − reserved-for-health-and-other-handlers). PaySetu's checkout calls 14 downstreams; with a 200-thread pool and 30 reserved threads for health checks and admin endpoints, the 14 bulkheads share 170 threads. If every limit were sized to peak, the sum would be 600 — meaning the caller's pool can be drained even with all bulkheads operating below their caps. The fix is to size each bulkhead to a fraction of peak proportional to its criticality: scorecard gets 60, commentary 40, ads 30, ..., clips 8. The total stays under 170. The clips bulkhead is small deliberately — recommended-clips is non-essential for the checkout flow, so its damage radius is bounded tightly.

Common confusions

"A bulkhead is the same as a circuit breaker." They protect different things. The breaker watches outcomes and stops sending calls when the failure rate crosses a threshold; it has memory and a recovery cycle. The bulkhead has no memory — it does not care why there are 30 in-flight calls, only that there should not be 31. Together, the breaker stops calling sick backends eventually, and the bulkhead caps the damage immediately. The right composition is bulkhead → breaker → retry → timeout → RPC.
"Just make the caller's pool bigger." Doubling a 400-thread pool to 800 threads doubles your context-switch budget and your memory footprint, and the next slow downstream still fills the pool — it just takes 2× as long. Threads do not solve concurrency-saturation problems; bulkheads do. PlayDream learned this in 2024 when they doubled the fantasy-results service from 300 to 600 threads after an incident; the next slow downstream filled the 600-thread pool in 18 seconds instead of 9. The bulkhead would have stopped the cascade at any pool size.
"Use one bulkhead per service." Use one per (callee, operation). CricStream's recommended-clips service has two operations: getRecommendedClips (called for every match-page render, 400 RPS) and recordWatchEvent (called only when a clip starts, 30 RPS). They have very different latency profiles and different criticality. One shared bulkhead for both makes sizing impossible — it has to be sized for the louder operation. Two separate bulkheads let each be sized correctly. The (callee, operation) tuple is the natural granularity, the same as the circuit breaker.
"A semaphore bulkhead protects against thread starvation." Only partially — it protects against the pool's threads piling up outside the bulkhead, but the threads inside the bulkhead are still the caller's threads and are still parked in the slow call. With a semaphore bulkhead set to 30, 30 of your caller's threads can still be stuck inside the slow downstream for the duration of the timeout. Only a thread-pool bulkhead fully decouples the caller's request thread from the slow call. Use semaphore for async code (where thread-blocking is not the model anyway), thread-pool for synchronous JDBC-style code where the cost is justified.
"acquire_timeout_ms > 0 is always better than 0." It depends. acquire_timeout_ms=0 (non-blocking) means the (N+1)th call fails instantly — fastest fail, lowest cost, but the user gets a hard error. acquire_timeout_ms=200 lets a small queue form: if a permit is released within 200 ms, the call goes through; otherwise it fails. This is gentler under transient bursts but lengthens the fail time and lets the caller's thread sit waiting. The right answer is workload-specific: latency-critical paths use 0 (non-blocking, hard fast-fail); throughput-tolerant background work uses small positive values. Hystrix's default was non-blocking; Resilience4j's default is configurable.
"Sizing a bulkhead is about the downstream's capacity." Backwards. The bulkhead protects the caller, not the downstream — it sizes the caller's willingness to be parked inside the downstream. The downstream's capacity is protected by its own server-side rate limiter (chapter 50). The two limits coexist: caller's bulkhead = "I refuse to park more than 30 of my threads inside you"; downstream's rate limit = "I refuse to accept more than 800 RPS from anyone". They protect different actors and need not match.

Going deeper

The naval origin — why "bulkhead" is the right word

A ship's bulkhead is a wall that divides the hull into watertight compartments. If the hull is breached, water floods only the breached compartment; the others stay dry, the ship stays afloat. The Titanic's design had this — five compartments could flood and the ship would still float — but the iceberg breached six. The lesson encoded in the term: bulkheads do not prevent failure; they bound the volume of water that one failure admits. In a service, the bulkhead does not prevent a downstream from being slow; it bounds the number of caller-threads that one slow downstream can capture. The metaphor is so apt that Michael Nygard's Release It! (2007) used it to name the pattern, and the name has stuck through three generations of resilience libraries (Hystrix, Resilience4j, Sentinel).

MealRush's promo-engine bulkhead saved Diwali — the post-incident analysis

MealRush's cart service calls 11 downstreams during checkout. On Diwali 2025 at 19:42 IST — peak order time — the promo-engine's Postgres replica failed over to a stale replica. Latency on applyPromo walked from 18 ms to 4.1 s. Without a bulkhead, the cart pool (350 threads) would have drained inside 11 seconds at 800 RPS. With a per-downstream semaphore bulkhead sized 50 for promo-engine, exactly 50 threads got stuck at any one time; the other 300 served addItem, getCart, priceItem calls normally. Users without an active promo saw no impact at all. Users with a promo saw "promo unavailable, cart total ₹{undiscounted}" instead of a 503 from cart. The promo-engine recovered in 2 minutes 14 seconds; total user-visible impact was 27,000 carts that ordered without their promo applied, mostly small discounts (₹15–₹50). MealRush's SRE incident review concluded the bulkhead "converted a P0 outage into a P3 partial degradation" — the same wording many incident reviews land on the first time they see a bulkhead do its job in production. Why the user got a degraded experience and not a hard failure: the call site for applyPromo was wrapped with try: bh.acquire(); applyPromo() except BulkheadFullError: skip_promo(). The bulkhead's fast-fail was caught and translated into "no promo this time" — a business decision the cart team had pre-agreed with marketing. The bulkhead handed control back to the caller in 50 microseconds; the fallback path did the rest. This is the pattern: bulkhead reports "no permit available", caller decides what graceful degradation looks like.

The relationship to backpressure and queue-depth limits

Bulkheads are a special case of the broader pattern called bounded concurrency or backpressure. Anywhere you have a producer that can outpace a consumer, you need a bounded buffer between them — the bound is what stops one producer from consuming the system's memory or threads. A bulkhead is just bounded concurrency at the caller→callee boundary. The same idea reappears as queue-depth limits at message brokers, max-in-flight at gRPC clients, and maxQueuedTasks at executor services. The book reference for this whole family of ideas is Brendan Gregg's framing of "USE method" (chapter 50 in this curriculum will revisit it) — every resource needs a saturation metric and a saturation-rejection policy. Bulkheads are the in-process version of that idea applied to outbound calls.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
# save bulkhead.py from the article body
python3 bulkhead.py
# Expected: ok≈100, fast-fail≈100. Increase the limit from 30 to 100 and re-run;
# the fast-fail count drops sharply because the bulkhead never saturates.
# Then reduce the slow-regime sleep from 2.0 to 0.1 and notice that fast-fails go
# to ~zero — the bulkhead only saturates when in-flight calls are slow.

Where this leads next

Bulkheads sit between the breaker and the timeout in the failure-isolation stack. Each layer guards a different mode:

Circuit breakers (Hystrix, Sentinel) — chapter 43; what trips when the breaker says "stop calling".
Timeouts and deadline propagation — chapter 45; without a timeout the bulkhead's permits are never released and the cap becomes a deadlock.
Hedged requests for the long tail — chapter 47; latency-driven duplication that has to coexist with the bulkhead's permit budget.
Rate limiting (token bucket, leaky bucket, sliding window) — chapter 50; the server-side counterpart to the caller-side bulkhead.

The composition that keeps services up is bulkhead → breaker → retry → timeout → RPC. Take the bulkhead out and the breaker is too slow; take the breaker out and the bulkhead saturates forever; take the timeout out and the bulkhead's permits are never released. Three layers, three failure modes, one composed defence.

References

Michael Nygard, Release It! (Pragmatic Bookshelf, 2nd ed. 2018) — the book that introduced the bulkhead-as-naval-metaphor pattern in the Stability Patterns chapter.
Resilience4j — Bulkhead module — the JVM successor to Hystrix; explains semaphore vs thread-pool isolation in detail.
Alibaba Sentinel — Concurrency limiting / Isolation — Sentinel's concurrency-control rules, which are bulkheads under a different name.
Netflix Hystrix Wiki — How it Works (Bulkhead Pattern) — original thread-pool-per-command rationale and the case for mandatory isolation.
Marc Brooker, "Caution: Decreasing Returns Ahead" — AWS Builder's Library — the broader case for bounded concurrency at scale.
Little's Law — John D.C. Little, "A Proof for the Queuing Formula L = λW" (1961) — the closed form behind bulkhead sizing.
Circuit breakers (Hystrix, Sentinel) — the previous chapter; the breaker that the bulkhead complements.
Brendan Gregg, "The USE Method" — saturation as a first-class metric; bulkhead rejections are the saturation signal for outbound calls.