Bulkheads

CricStream's match-page service makes 9 downstream calls to render a single screen during a final: scorecard, commentary, predictions, ads, user profile, fantasy-team, push-token, video-manifest, recommended-clips. Eight of them are healthy. The ninth — recommended-clips — has a Redis replica doing a slow snapshot. Its p99 walks from 30 ms to 3.5 s. The match-page service has a 400-thread Tomcat pool. Within 9 seconds, more than 380 of those 400 threads are blocked inside getRecommendedClips(), waiting on the slow Redis. The other 8 backends are still healthy — scorecard, commentary, ads, all responding in under 50 ms — but the match-page service has no threads left to call them. /health stops responding. The orchestrator kills the pod. 67 sibling pods cascade. The circuit breaker for recommended-clips trips eventually, but by then the service is already down. The breaker was watching the wrong layer; the threads ran out before the failure rate ever crossed 50%.

A bulkhead is a hard cap on how many in-flight calls one downstream — or one workload class — can hold at once, enforced by a per-callee semaphore or thread pool. The name comes from ship hulls: a ruptured compartment floods that compartment alone, not the whole vessel. In a service, the bulkhead caps the damage radius of a single sick backend so other calls keep flowing. It is the partner of the circuit breaker, not a substitute — the breaker decides when to stop calling, the bulkhead caps how many threads can be parked inside a slow call right now.

Why the breaker is not enough — the failure that runs out the clock

Circuit breakers (chapter 43) react to a sliding window of failure rate. They need evidence before they trip — typically 50% failure over the last 20 calls. That logic is sound when failures are loud (5xx, RST, timeout) and the call rate is high enough to fill the window quickly. But two failure modes leave the breaker with nothing to react to:

  1. The slow but successful call. Recommended-clips returned 200 OK every time during CricStream's incident — Redis was slow, not broken. The breaker's failure-rate metric stayed at 0%. Even with the slow-call threshold turned on, the breaker needed 20 calls' worth of evidence to trip; at one call per match-page render, that took longer than the time required to drain the 400-thread pool.

  2. The first burst of failure during a low-traffic window. PaisaCard's rewards-redemption service averages 4 RPS. Its breaker is configured requestVolumeThreshold=20 — sensible, because a smaller window flaps. When the loyalty-points backend goes slow, the breaker sees 4 slow calls per second; it takes 5 seconds to fill the window. Five seconds of slow calls at 6.6 s each parks ≈30 threads. The pool is 50 threads. The breaker trips at second 5 and saves the remaining 20, but the damage is already done — /health is failing because not enough threads remain to handle the kubelet probe.

The bulkhead solves both. It does not need evidence. It does not have a sliding window. It is a counter: "no more than N calls into this downstream, ever, regardless of why". The (N+1)th call fast-fails before it consumes a thread. Why a counter beats a failure-rate window: the breaker observes outcomes that have already completed; the bulkhead observes calls that are still in flight. A pool can drain in seconds while the breaker is still gathering its 20-call window. Counters react in O(1) time; sliding windows react in O(window-size × call-period) time. For low-traffic, slow-failure cases, the latter is asymptotically too late.

Bulkhead isolation — three downstreams, three semaphores, one shared caller poolA diagram showing one caller (match-page service) with a 400-thread shared pool on the left. Three downstream calls leave the caller, each guarded by a per-downstream semaphore: scorecard (limit 200), commentary (limit 150), recommended-clips (limit 30). The recommended-clips semaphore is shown saturated with 30 in-flight calls; new calls fast-fail. Scorecard and commentary still have headroom. Annotations describe the cap in each lane and the fast-fail outcome. Bulkhead isolation — caps per downstream, not per pool match-page service shared pool: 400 threads in use across all callees request handler picks 1 of 400 threads to call a downstream scorecard semaphore limit 200, in-use 47 healthy, headroom commentary semaphore limit 150, in-use 22 healthy, headroom clips semaphore limit 30, in-use 30 SATURATED — fast-fail scorecard commentary clips (slow) when clips is sick: at most 30 of 400 threads stuck — 370 still serve scorecard / commentary / health
Illustrative — not measured. Each downstream has its own semaphore that caps in-flight calls into it. A sick downstream can saturate its own lane but cannot drain the shared 400-thread pool below the sum of caps it does not own.
Pool occupancy timeline — with and without a bulkheadTwo stacked time-series charts on the same x-axis (0 to 30 seconds). Top: without bulkhead — caller's pool occupancy rises from 50 of 400 threads at t=0 to 400 of 400 threads at t=10s after a downstream goes slow at t=2s, stays saturated, /health probe fails at t=15s, pod killed. Bottom: with bulkhead (limit 30) — pool occupancy rises to about 90 of 400 by t=4s and plateaus there, plenty of headroom, /health probe never fails. The downstream recovers at t=20s and pool returns to baseline. Caller pool occupancy when one downstream goes slow at t=2s Without bulkhead 400 0 slow regime starts /health fails, pod killed cascade With bulkhead (limit 30) 400 0 downstream recovers plateau ≈ 30 stuck + 60 healthy = 90 of 400 t=0s t=8s t=16s t=24s t=30s
Illustrative — not measured data. The bulkhead's job is visible as the gap between "saturated at 400" and "plateau at ≈90". The plateau is the price of doing business; the saturation is the cost of skipping the bulkhead.

The two flavours — semaphore bulkhead vs thread-pool bulkhead

Bulkheads come in two implementations. The choice depends on whether your protected calls block a thread or run on an event loop.

Semaphore bulkhead. A counter — usually Semaphore in Java, asyncio.Semaphore in Python, chan struct{} of fixed length in Go. Each call acquires a permit before invoking the downstream and releases it on return. If no permit is available, the call fast-fails immediately. The acquire/release is essentially free — a pair of atomic increments and a contention check. This is the right choice when:

Thread-pool bulkhead. A separate, fixed-size executor for each protected downstream. The caller's request thread submits a task to the executor and waits for the future (with timeout). When the executor is full and its queue is full, submission throws and the call fast-fails. The cost is real: a context switch on every call (typically 2–5 μs on Linux), and a separate scheduler queue per downstream. The benefit is real too: the caller's main thread is never parked inside the slow call, only the executor's worker is — so saturation of one executor cannot starve the request-handler pool. This is the right choice when:

Semaphore Thread pool
Per-call overhead ~50 ns (atomic ops) ~2–5 μs (context switch + queue)
Memory per bulkhead 1 counter N threads × stack (typ. 1 MB each)
Caller thread parked in slow call? Yes (until permit times out) No (only worker thread)
Default in Hystrix Available, opt-in Mandatory
Default in Sentinel / Resilience4j Default Opt-in
Right for async code? Yes No (defeats async)
Right for blocking JDBC / synchronous HTTP? Acceptable Better

Hystrix's mandatory thread-pool-per-command is what made it heavy at scale (chapter 43 explained why Netflix retired it). Resilience4j and Sentinel both default to semaphore bulkheads now, which is the right call for modern async code where the calling thread is not pinned to the request anyway.

Building one — a working semaphore bulkhead

# bulkhead.py — semaphore bulkhead with fast-fail and timed acquire
import time, random, threading
from contextlib import contextmanager

class BulkheadFullError(Exception): pass

class Bulkhead:
    def __init__(self, name, limit, acquire_timeout_ms=0):
        self.name = name
        self.limit = limit
        self.acquire_timeout_s = acquire_timeout_ms / 1000.0
        self.sem = threading.Semaphore(limit)
        self.in_use = 0
        self.rejected = 0
        self.lock = threading.Lock()

    @contextmanager
    def acquire(self):
        ok = self.sem.acquire(timeout=self.acquire_timeout_s) if self.acquire_timeout_s > 0 \
             else self.sem.acquire(blocking=False)
        if not ok:
            with self.lock: self.rejected += 1
            raise BulkheadFullError(f"{self.name} full: {self.in_use}/{self.limit}")
        with self.lock: self.in_use += 1
        try: yield
        finally:
            with self.lock: self.in_use -= 1
            self.sem.release()

# --- demo: 200 callers, downstream becomes slow at t=1s
def downstream(t_start):
    if time.time() - t_start > 1.0: time.sleep(2.0)   # slow regime
    else: time.sleep(0.01)
    return "ok"

def caller(bh, t_start, results, idx):
    try:
        with bh.acquire():
            downstream(t_start)
            results[idx] = "ok"
    except BulkheadFullError:
        results[idx] = "fast-fail"

random.seed(3)
bh = Bulkhead("clips", limit=30, acquire_timeout_ms=0)   # zero = non-blocking acquire
start = time.time()
threads, results = [], [None] * 200
for i in range(200):
    t = threading.Thread(target=caller, args=(bh, start, results, i))
    threads.append(t); t.start()
    time.sleep(0.01)   # 100 RPS arrival rate
for t in threads: t.join()

ok = sum(1 for r in results if r == "ok")
ff = sum(1 for r in results if r == "fast-fail")
print(f"ok={ok}  fast-fail={ff}  rejected_total={bh.rejected}")

Sample run on a M2 MacBook Air, Python 3.11:

ok=104  fast-fail=96  rejected_total=96

Per-line walkthrough. Bulkhead(name, limit, acquire_timeout_ms) is the policy: limit=30 is the cap, acquire_timeout_ms=0 means non-blocking — if the semaphore is full, fail immediately rather than waiting in line. @contextmanager lets the call site say with bh.acquire(): rpc() so that release happens on every exit path including exceptions. self.sem.acquire(blocking=False) is the fast-fail path; it returns False instantly if all 30 permits are out. The in_use counter is for observability — it shows the dashboard how saturated each bulkhead is right now, which is the metric you alert on. Read the result: in the first second, calls complete in 10 ms, the bulkhead never saturates, all callers succeed. After t=1s the downstream becomes slow; with 30 permits and 2-second calls, the bulkhead saturates at the 30th in-flight call and every subsequent caller fast-fails until permits are released. The 96 fast-fails protected the caller — those threads returned to their pool in microseconds instead of being parked for 2 seconds. Why fast-fail is correct here even though it returns "fast-fail" for some users: the alternative is that those 96 threads pile up, the caller's pool runs out, and every other caller (calling other healthy backends) starts to fail too. The bulkhead converts "everyone times out at 2s" into "30 users get the slow service, 96 users get an instant degraded response, the rest of the system stays healthy". The total user-perceived availability is higher with the bulkhead than without.

Sizing the bulkhead — Little's Law tells you the answer

The hardest question with bulkheads is "what should the limit be?" The answer is not arbitrary; Little's Law gives you a closed form. For a downstream with average steady-state latency L seconds and offered traffic R requests per second, the average concurrency is L × R. To carry the steady-state load with no rejections, the bulkhead limit must be ≥ L × R. To carry the peak load, it must be ≥ L_peak × R_peak. Above that, the bulkhead has headroom; below that, it rejects healthy traffic.

The rule of thumb that actually works in production:

limit = ceil(p99_latency_seconds × peak_RPS × 1.5)

The 1.5 factor is the headroom for short bursts. Worked example for CricStream's recommended-clips: average latency 30 ms, p99 latency 80 ms, peak 400 RPS during a final. 0.080 × 400 × 1.5 = 48. Round to 50. Why use p99 not average: average concurrency is what the system holds in steady state, but a sized-to-average bulkhead will reject during the natural variance around the average. Little's Law applies to a stable distribution; production has spikes. p99-sized bulkheads carry the burst without rejecting healthy traffic, while still capping the damage radius if the downstream goes slow (a 10× slowdown turns a 50-permit bulkhead into a 50-permit bulkhead — the limit does not move, only the concurrency at the limit does, which is exactly what you want).

The other consideration is the caller's total pool. Sum of all bulkhead limits should ideally not exceed (caller pool size − reserved-for-health-and-other-handlers). PaySetu's checkout calls 14 downstreams; with a 200-thread pool and 30 reserved threads for health checks and admin endpoints, the 14 bulkheads share 170 threads. If every limit were sized to peak, the sum would be 600 — meaning the caller's pool can be drained even with all bulkheads operating below their caps. The fix is to size each bulkhead to a fraction of peak proportional to its criticality: scorecard gets 60, commentary 40, ads 30, ..., clips 8. The total stays under 170. The clips bulkhead is small deliberately — recommended-clips is non-essential for the checkout flow, so its damage radius is bounded tightly.

Common confusions

Going deeper

The naval origin — why "bulkhead" is the right word

A ship's bulkhead is a wall that divides the hull into watertight compartments. If the hull is breached, water floods only the breached compartment; the others stay dry, the ship stays afloat. The Titanic's design had this — five compartments could flood and the ship would still float — but the iceberg breached six. The lesson encoded in the term: bulkheads do not prevent failure; they bound the volume of water that one failure admits. In a service, the bulkhead does not prevent a downstream from being slow; it bounds the number of caller-threads that one slow downstream can capture. The metaphor is so apt that Michael Nygard's Release It! (2007) used it to name the pattern, and the name has stuck through three generations of resilience libraries (Hystrix, Resilience4j, Sentinel).

MealRush's promo-engine bulkhead saved Diwali — the post-incident analysis

MealRush's cart service calls 11 downstreams during checkout. On Diwali 2025 at 19:42 IST — peak order time — the promo-engine's Postgres replica failed over to a stale replica. Latency on applyPromo walked from 18 ms to 4.1 s. Without a bulkhead, the cart pool (350 threads) would have drained inside 11 seconds at 800 RPS. With a per-downstream semaphore bulkhead sized 50 for promo-engine, exactly 50 threads got stuck at any one time; the other 300 served addItem, getCart, priceItem calls normally. Users without an active promo saw no impact at all. Users with a promo saw "promo unavailable, cart total ₹{undiscounted}" instead of a 503 from cart. The promo-engine recovered in 2 minutes 14 seconds; total user-visible impact was 27,000 carts that ordered without their promo applied, mostly small discounts (₹15–₹50). MealRush's SRE incident review concluded the bulkhead "converted a P0 outage into a P3 partial degradation" — the same wording many incident reviews land on the first time they see a bulkhead do its job in production. Why the user got a degraded experience and not a hard failure: the call site for applyPromo was wrapped with try: bh.acquire(); applyPromo() except BulkheadFullError: skip_promo(). The bulkhead's fast-fail was caught and translated into "no promo this time" — a business decision the cart team had pre-agreed with marketing. The bulkhead handed control back to the caller in 50 microseconds; the fallback path did the rest. This is the pattern: bulkhead reports "no permit available", caller decides what graceful degradation looks like.

The relationship to backpressure and queue-depth limits

Bulkheads are a special case of the broader pattern called bounded concurrency or backpressure. Anywhere you have a producer that can outpace a consumer, you need a bounded buffer between them — the bound is what stops one producer from consuming the system's memory or threads. A bulkhead is just bounded concurrency at the caller→callee boundary. The same idea reappears as queue-depth limits at message brokers, max-in-flight at gRPC clients, and maxQueuedTasks at executor services. The book reference for this whole family of ideas is Brendan Gregg's framing of "USE method" (chapter 50 in this curriculum will revisit it) — every resource needs a saturation metric and a saturation-rejection policy. Bulkheads are the in-process version of that idea applied to outbound calls.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
# save bulkhead.py from the article body
python3 bulkhead.py
# Expected: ok≈100, fast-fail≈100. Increase the limit from 30 to 100 and re-run;
# the fast-fail count drops sharply because the bulkhead never saturates.
# Then reduce the slow-regime sleep from 2.0 to 0.1 and notice that fast-fails go
# to ~zero — the bulkhead only saturates when in-flight calls are slow.

Where this leads next

Bulkheads sit between the breaker and the timeout in the failure-isolation stack. Each layer guards a different mode:

The composition that keeps services up is bulkhead → breaker → retry → timeout → RPC. Take the bulkhead out and the breaker is too slow; take the breaker out and the bulkhead saturates forever; take the timeout out and the bulkhead's permits are never released. Three layers, three failure modes, one composed defence.

References

  1. Michael Nygard, Release It! (Pragmatic Bookshelf, 2nd ed. 2018) — the book that introduced the bulkhead-as-naval-metaphor pattern in the Stability Patterns chapter.
  2. Resilience4j — Bulkhead module — the JVM successor to Hystrix; explains semaphore vs thread-pool isolation in detail.
  3. Alibaba Sentinel — Concurrency limiting / Isolation — Sentinel's concurrency-control rules, which are bulkheads under a different name.
  4. Netflix Hystrix Wiki — How it Works (Bulkhead Pattern) — original thread-pool-per-command rationale and the case for mandatory isolation.
  5. Marc Brooker, "Caution: Decreasing Returns Ahead" — AWS Builder's Library — the broader case for bounded concurrency at scale.
  6. Little's Law — John D.C. Little, "A Proof for the Queuing Formula L = λW" (1961) — the closed form behind bulkhead sizing.
  7. Circuit breakers (Hystrix, Sentinel) — the previous chapter; the breaker that the bulkhead complements.
  8. Brendan Gregg, "The USE Method" — saturation as a first-class metric; bulkhead rejections are the saturation signal for outbound calls.