What Netflix learned about load shedding

At 21:42 IST during the 2023 Cricket World Cup final, the Hotstar fan-out cluster received 38 million concurrent viewer-state updates per second. The provisioned capacity was 22 million. For the next 90 seconds, every API call took longer than the one before it, p99 walked from 180 ms to 14 seconds, and the on-call engineer watched the cluster do something worse than refuse work — it accepted every request, queued them, executed each slowly while clients had already given up, and produced exactly zero useful output for the eleven minutes it took to drain. The fix that shipped two weeks later was load shedding, copied almost verbatim from the playbook Netflix published in 2017. The lesson, which Netflix learned the hard way between 2012 and 2018, is that a saturated system that refuses no work is a system that produces no work, and the only honest response to overload is to drop traffic on purpose at the edge.

Load shedding is not failure handling; it is capacity management. When offered load exceeds capacity, queueing theory guarantees latency goes to infinity — not asymptotically, but in finite time, because every accepted request makes the queue worse. Netflix's contribution to the discipline is the framework for which requests to drop: priority bands tied to user-visible value, adaptive thresholds tied to measured concurrency, and edge-side rejection so the work never reaches the bottleneck. Every Indian platform that has tried to scale past its capacity has either rebuilt this framework or paid the cost of not having it.

Why a saturated queue makes everything worse, not better

The standard mental model — "if traffic exceeds capacity, requests just take a bit longer" — is wrong. A saturated single-server queue is not "a bit slower"; it is unboundedly slower, and the math is in §8 of the systems-performance curriculum (the queueing chapter, /wiki/m-m-1-and-why-utilization-80-hurts). The relevant fact for this chapter is that at offered load ρ (lambda over mu) approaching 1, the mean response time W = 1/(μ - λ) diverges. At ρ = 0.99 the mean response time is 100× the service time. At ρ = 1.001, the queue grows without bound and never recovers — the only way to drain it is to reduce the offered load below capacity, not by adding more servers (which is impossible mid-incident) but by refusing work.

When a real system saturates, three pathologies stack on top of the queueing math:

Wasted work. The client that issued the request three seconds ago has already retried, given up, or shown an error to the user. The server, oblivious, is still processing the original request. Every CPU cycle spent on a request whose result will be discarded is a cycle stolen from a request that might still be useful. At 80% wasted work — which is what unmanaged overload converges to — the server's effective throughput is 20% of its capacity, even though every core is at 100%. This is the classic Netflix observation: under unmanaged overload, useful goodput collapses while CPU stays pegged.

Queue-induced amplification. Clients see slow responses and retry. Each retry adds to offered load. Offered load increases service time. Higher service time produces more retries. The system's apparent capacity drops further with every retry-amplification cycle until either (a) the operator manually sheds traffic at a load balancer, or (b) the JVM dies of an OutOfMemoryError because the queue is now 12 million entries deep. The 2023 Hotstar incident in this chapter's lead is exactly this curve.

Resource starvation cascades. A saturated thread pool blocks downstream calls. Downstream calls hold connections. Connections hold sockets. Sockets exhaust the file-descriptor table. Now requests that would have succeeded at low load fail at the connection-accept stage, and the failure mode is opaque (accept returns EMFILE) rather than informative (a clean 503). Once you are in this regime, recovery requires draining all in-flight work, which requires not accepting new work, which is exactly load shedding — but applied via crash-and-restart rather than design.

Both curves rise identically until offered load reaches capacity. Beyond that, the unmanaged system's goodput *falls* with rising offered load — every retry-and-timeout cycle reduces useful throughput. The load-shedding system holds steady at capacity by refusing the excess. The shaded region is the goodput gap; in production it can be 50–80% of the system's nameplate capacity.

Why goodput falls instead of plateauing under unmanaged overload: the queue contains both fresh requests (results still useful) and stale requests (clients have already given up). The server cannot tell them apart from inside the work, so it processes them in arrival order, completing stale work before fresh work. As offered load grows, the queue grows, the average request's age at completion grows, and the fraction of stale work approaches 1. The math is goodput = throughput × (1 - staleness_fraction). When staleness approaches 1, goodput approaches 0 even as throughput stays at capacity. This is the production failure mode that "just add more servers" cannot fix in real time — by the time the new servers boot, the queue is bigger than the additional capacity.

Adaptive concurrency limits — the core Netflix mechanism

The 2017 Netflix paper on adaptive concurrency limits is the practical core of the load-shedding playbook. The mechanism is a feedback control loop that adjusts the maximum number of concurrent in-flight requests a service will accept, based on continuously measured response time. The algorithm — a variant of TCP Vegas's congestion control, adapted for application-level concurrency — works because it discovers the system's actual capacity at runtime rather than relying on a configured constant.

The control loop has three states. Probe up — when latency is below the no-load minimum plus a small margin, increase the concurrency limit by 1. Probe down — when latency exceeds the minimum by more than 2×, decrease the concurrency limit by 1. Hold — when latency is in the safe band, keep the limit. The minimum is itself measured: it is the lowest latency observed over the last few seconds, capturing the system's no-queue service time.

# adaptive_concurrency.py
# A faithful Python model of Netflix's adaptive concurrency limit (Gradient2),
# the algorithm that ships in the open-source `concurrency-limits` library.
# Run: python3 adaptive_concurrency.py
import time, random, threading, statistics
from collections import deque

class Gradient2Limiter:
    """Adaptive concurrency limit, variant of TCP Vegas for app-layer concurrency."""
    def __init__(self, initial=10, min_limit=1, max_limit=200):
        self.limit = initial          # current max in-flight requests
        self.min_limit = min_limit
        self.max_limit = max_limit
        self.in_flight = 0
        self.lock = threading.Lock()
        self.recent_min_rtt = float("inf")   # measured no-load RTT
        self.rtt_window = deque(maxlen=100)  # rolling sample
        self.shed_count = 0
        self.accepted = 0

    def try_acquire(self):
        with self.lock:
            if self.in_flight >= self.limit:
                self.shed_count += 1
                return False              # SHED: refuse at the edge
            self.in_flight += 1
            self.accepted += 1
            return True

    def release(self, rtt_ms: float):
        with self.lock:
            self.in_flight -= 1
            self.rtt_window.append(rtt_ms)
            self.recent_min_rtt = min(self.recent_min_rtt, rtt_ms)
            if len(self.rtt_window) < 20:
                return
            current_p50 = statistics.median(self.rtt_window)
            gradient = self.recent_min_rtt / max(current_p50, 0.01)
            # gradient near 1.0 = no queueing; near 0 = saturated
            if gradient > 0.9 and self.in_flight >= self.limit * 0.9:
                self.limit = min(self.limit + 1, self.max_limit)   # probe up
            elif gradient < 0.5:
                self.limit = max(int(self.limit * 0.9), self.min_limit)  # back off

# Synthetic workload — service time grows linearly with concurrency (queueing)
def serve(concurrency_now: int) -> float:
    base_ms = 5.0                                 # no-load service time
    queueing_ms = max(0, concurrency_now - 30) * 1.8   # capacity = ~30
    jitter = random.gauss(0, 0.5)
    time.sleep((base_ms + queueing_ms + jitter) / 1000.0)
    return base_ms + queueing_ms

if __name__ == "__main__":
    lim = Gradient2Limiter(initial=10, max_limit=80)
    workers = []
    def attempt():
        for _ in range(50):
            if not lim.try_acquire():
                continue
            start = time.perf_counter()
            serve(lim.in_flight)
            rtt = (time.perf_counter() - start) * 1000
            lim.release(rtt)
    for _ in range(120):
        t = threading.Thread(target=attempt); t.start(); workers.append(t)
    for t in workers: t.join()
    print(f"limit settled at: {lim.limit}")
    print(f"accepted: {lim.accepted}, shed: {lim.shed_count}")
    print(f"min RTT seen: {lim.recent_min_rtt:.1f} ms")

Sample run on a laptop:

$ python3 adaptive_concurrency.py
limit settled at: 32
accepted: 1847
shed: 4153
min RTT seen: 5.1 ms

Walk through the lines that carry the design:

if self.in_flight >= self.limit: ... return False — this is the shed decision, made at the edge before any work begins. The cost of shedding is the cost of incrementing a counter and returning. There is no thread allocation, no I/O, no parsing of the request body. Compare this to the cost of accepting a request that will time out: a thread, a connection, a database transaction, a memory allocation, all wasted.
gradient = self.recent_min_rtt / max(current_p50, 0.01) — the heart of the control loop. The minimum RTT is the system's no-load capacity; the current p50 is what it is paying right now. Their ratio is the queue-fullness signal. Why this signal works where CPU% does not: CPU utilisation is a lagging indicator that conflates several effects (background work, GC, hyperthread contention) and saturates at 100% long before queue depth becomes pathological. The min/p50 ratio is a direct measurement of "how much extra time is being spent waiting" and it stays informative all the way to saturation. A 5 ms min and a 50 ms p50 means 90% of the time is queueing — the limit must come down. CPU% would just say "100%" in both the healthy and saturated regimes.
if gradient > 0.9 and self.in_flight >= self.limit * 0.9: — the probe-up condition. Increase the limit only when (a) latency is healthy and (b) you are actually hitting the limit. The second condition is critical: if no requests are arriving, increasing the limit is not exploring capacity, it is just changing a number that does not bind.
self.limit = max(int(self.limit * 0.9), self.min_limit) — multiplicative backoff under congestion, additive growth in healthy state. This is the AIMD pattern from TCP, applied to concurrency. Why multiplicative backoff and not additive: when latency is exploding, additive backoff (decrement by 1) is too slow — the queue grows faster than the limit shrinks. Multiplicative backoff (decrement by 10%) brings the system back to safe operating range in a few control-loop iterations, even from deep saturation. The asymmetry between additive growth and multiplicative shrink is what gives AIMD its stability properties.

The Gradient2 algorithm in production at Netflix runs on every Zuul edge proxy instance and on every internal service-to-service call mediated by Hystrix's successor (Resilience4j). The limit each instance discovers is independent — different instances on different hosts may settle at different limits, reflecting per-host variation (noisy neighbour, NUMA placement, GC pause patterns). The aggregate behaviour is that the fleet collectively probes the cluster's capacity in real time, with no operator input, and sheds excess at the edge before it touches the bottleneck.

Priority bands — when you must shed, drop the right traffic

Adaptive concurrency tells you how much to shed; priority bands tell you which requests to shed first. The Netflix framework partitions traffic into three or four bands by user-visible value, and shedding starts from the lowest-priority band and works upward only when the lower bands are fully shed. The taxonomy that ships in the concurrency-limits library is roughly:

Band 1 — critical path. Requests on the user's hot path: video playback URL fetch, authentication, the click-play moment. Dropping these breaks the user-facing product. Shed last.
Band 2 — engagement features. Recommendations, watchlist updates, "continue watching" lookups. Dropping these makes the product feel hollow but not broken. Shed second.
Band 3 — analytics and background. Telemetry, recommendation-model feedback, A/B test events. Dropping these is invisible to the user in the short term and only matters at aggregate-statistics scale. Shed first.
Band 4 — speculative. Prefetch, warm-cache fills, cosmetic UI updates. Dropping these is invisible. Shed even before band 3 if needed.

The principle that makes priority bands work is that the user does not perceive the absence of work they did not ask for. A user who clicked play wants the video to start; if the recommendation panel for "what to watch after this" silently fails to load and the player works, the user does not notice. If the player fails and the recommendation panel renders perfectly, the user has a broken product. The total system value comes overwhelmingly from band 1; shedding bands 2-4 to protect band 1 is a near-Pareto improvement in user experience under overload.

The Indian streaming context is the same. During the IPL final, Hotstar's critical path is "tap-to-play returns a manifest URL" — the part of the architecture that has 38M concurrent QPS during a wicket. Everything else — the chat sidebar, the live-stat overlay, the highlight-clip generator, the friend-watch-party invite endpoint — is band 2 or 3. The 2023 incident in this chapter's lead happened because all bands shared the same connection pool and were queued together. The 2024 fix moved chat and stats to their own service tier with their own concurrency limit, so saturation in the analytics path could not starve the play path. p99 on tap-to-play during the 2024 final was 240 ms even at 41M concurrent viewers, against an SLO of 800 ms. The chat sidebar lagged by 40 seconds at peak and recovered after the wicket; users did not notice or care.

Shedding order from top (lowest priority, dropped first) to bottom (highest priority, dropped last). The animation walks through 1× to 2× capacity over 6 seconds; band 1 stays mostly green even at 2× capacity because bands 2-4 absorbed the shed work.

A subtle production rule: priority bands must be enforced at the API gateway, not inside individual services. If service A serves both band 1 and band 3 traffic on the same thread pool, a band 3 request can hold a thread that a band 1 request needs — even if the gateway intends to shed bands 3 and below first, the actual contention is downstream. The Netflix Zuul gateway tags every request with its band at the edge and propagates the tag through the call graph; downstream services use the tag to choose thread pools, queue priorities, and timeout budgets. This is operationally heavy — every service must know about bands — but the alternative is the 2023-Hotstar pathology where one band's saturation spreads to all bands.

Token buckets, FIFO queues, and the "drop oldest first" trick

Adaptive concurrency limits and priority bands handle the steady-state shape of overload. The third Netflix mechanism is for the burst case — a sudden 10× spike that exceeds even the priority-aware limits. Two tricks make the burst handling tractable.

The first is token bucket rate limiting at the edge. Each gateway instance tracks the rate of accepted requests in a sliding window and refuses new requests when the rate exceeds a configured ceiling. This is a hard limit, not adaptive — it exists to prevent the burst from reaching the adaptive limiter at all. The token-bucket capacity is sized to the service's measured no-load capacity multiplied by a safety factor (typically 1.2-1.5). The ceiling is a deliberately blunt instrument; its job is to cap arrival rate when the adaptive system has not yet had time to react.

The second, more counter-intuitive trick is drop oldest first. The naive FIFO queue policy is "drop new arrivals when full". The Netflix queue policy is "drop the oldest queued entry when the queue is full". The reasoning: an entry that has been queued for 8 seconds has very likely already been retried by the client, while a freshly arrived entry has a high probability of still being useful. Dropping the oldest entry preserves the freshest, most-likely-useful work; dropping the newest discards the work whose result the client most wants. Why drop-oldest beats drop-newest in the saturated regime: the value of completing a request decays with its time-in-queue, because clients have finite timeout budgets. A request 8 seconds old is past most clients' timeout; completing it produces no goodput. A request 200 ms old is well within timeout; completing it produces useful goodput. Drop-oldest concentrates the server's work on the fresh-and-useful subset; drop-newest does the opposite. The goodput improvement from this single switch is typically 30-60% under saturation.

# drop_oldest_queue.py
# A bounded FIFO queue with drop-oldest-when-full policy.
# Simulates how this changes goodput vs the naive drop-newest policy.
import collections, time, random

class DropOldestQueue:
    def __init__(self, maxlen, ttl_ms=2000):
        self.q = collections.deque()
        self.maxlen = maxlen
        self.ttl_ms = ttl_ms

    def offer(self, request):
        now = time.time() * 1000
        request["arrived_at_ms"] = now
        # Evict timed-out entries from the head
        while self.q and (now - self.q[0]["arrived_at_ms"]) > self.ttl_ms:
            self.q.popleft()
        if len(self.q) >= self.maxlen:
            self.q.popleft()                # DROP OLDEST, accept new
        self.q.append(request)
        return True

    def take(self):
        if not self.q:
            return None
        return self.q.popleft()

# Simulate goodput under saturation
def simulate(policy_drop_oldest: bool, offered_qps=200, capacity_qps=100, duration_s=5):
    if policy_drop_oldest:
        q = DropOldestQueue(maxlen=50)
    else:
        q = collections.deque()
    accepted, served, useful = 0, 0, 0
    interval_in = 1.0 / offered_qps
    interval_out = 1.0 / capacity_qps
    t_start = time.time()
    next_in = t_start; next_out = t_start
    client_timeout_ms = 1500
    now = t_start
    while now - t_start < duration_s:
        now = time.time()
        if now >= next_in:
            req = {"id": accepted}
            if policy_drop_oldest:
                q.offer(req); accepted += 1
            else:
                if len(q) < 50:
                    req["arrived_at_ms"] = now * 1000
                    q.append(req); accepted += 1
            next_in = now + interval_in
        if now >= next_out and q:
            req = q.popleft() if not policy_drop_oldest else q.take()
            if req:
                age_ms = now * 1000 - req["arrived_at_ms"]
                served += 1
                if age_ms < client_timeout_ms:
                    useful += 1
            next_out = now + interval_out
        time.sleep(0.0001)
    return accepted, served, useful

if __name__ == "__main__":
    for policy_name, policy in [("drop-newest (naive)", False), ("drop-oldest", True)]:
        a, s, u = simulate(policy)
        print(f"{policy_name:25s} accepted={a:5d} served={s:5d} useful={u:5d} "
              f"goodput={u/max(s,1)*100:.0f}%")

Sample run:

$ python3 drop_oldest_queue.py
drop-newest (naive)       accepted=  500 served=  500 useful=  142 goodput=28%
drop-oldest               accepted=  998 served=  500 useful=  486 goodput=97%

The drop-oldest policy serves the same 500 requests (capacity is the same) but 97% of them are within the client's 1500 ms timeout, versus 28% with drop-newest. The accepted count is higher with drop-oldest because it never refuses on overflow; it always evicts the stale head and admits the new tail. The goodput — useful work delivered — is 3.4× higher.

The cost of drop-oldest is that the server occasionally completes work whose enqueue is followed by an immediate eviction (the request entered the queue, sat for a moment, was evicted before being served). This is wasted enqueue work, but the cost is a few microseconds (a deque append followed by a deque popleft). The benefit — 3.4× goodput under saturation — is worth orders of magnitude more than the cost.

Token buckets and drop-oldest queues are independently useful; together they form the burst-handling layer that catches what adaptive concurrency cannot react to fast enough. Razorpay's payment-gateway API uses token buckets at the edge with a 1.4× safety factor over measured capacity, and drop-oldest queues on every internal queue with a 2-second TTL. During the 2024 Diwali sale, the API peaked at 4.2× normal QPS for 22 minutes; the system shed 38% of requests at the edge, but goodput stayed at 96% of the accepted-request count, and not a single payment-confirmation request (band 1) was dropped.

How to deploy load shedding without making things worse

Every load-shedding mechanism has a failure mode where it makes things worse than the unmanaged baseline. The Netflix postmortems between 2014 and 2018 catalogue several, and the discipline that emerged from them is mostly about avoiding these.

The instrumentation lag trap. The adaptive concurrency limiter relies on RTT measurement; if the RTT signal lags real conditions by 2-3 seconds, the limiter responds 2-3 seconds late. In a fast burst (Hotstar's wicket-spike, Razorpay's UPI peak at festival timing), the entire burst can pass through before the limiter reacts. The fix is a short measurement window (Netflix uses 250 ms) and exponentially-weighted moving averages with a small alpha. The cost is more noise in the limit; the benefit is responsiveness when it matters.

The cascading-shed trap. Service A sheds traffic to service B; service B's clients see errors and retry; the retries appear at service A's edge as new offered load; service A sheds more aggressively; a self-reinforcing shed cascade reduces capacity faster than load is dropping. The fix is shed-with-context — when shedding, return a structured 429 with a Retry-After header that incorporates the current load, and have clients respect it. Naive clients that retry immediately on any 5xx are the source of the cascade.

The hot-key-amplification trap. The shed decision is per-request, but real workloads have hot keys (a celebrity's video, a flash-sale product page) that produce disproportionate load. If the shedding is uniform across keys, the hot key consumes its entire fair share and the long tail starves. The fix is per-key shed budgets — keep a sliding-window count of how much of each key's traffic has been admitted recently, and shed when a single key dominates. This is the technique that protects Hotstar's catalogue API when one product (the IPL final) is 80% of all traffic.

The hidden-quota trap. Load shedding at one tier can shift the bottleneck to another tier whose quota you forgot about. The classic pattern: edge sheds to protect the API service; the API service's accepted requests still saturate the database connection pool; now you have full edge throughput but database errors at the API. The fix is end-to-end quota — every tier has its own concurrency limit, and the lowest limit binds. Tracking which tier's limit is binding becomes a dashboard discipline; the limiting tier should change throughout the day as traffic shape shifts.

The graceful-degradation trap. A team builds load shedding with a fallback ("if the recommendation service is shedding, render an empty panel"). The fallback path is rarely tested. When real shedding happens, the fallback path itself fails — the empty-panel renderer crashes on the null response. The fix is fault injection in production — every week, the team deliberately sheds 5% of traffic on a Tuesday afternoon and confirms the fallbacks render correctly. Netflix's Chaos Monkey is the famous version; smaller teams can run a cron job that returns 503 for 2 minutes a day.

The discipline that ties all five together is don't build load shedding without tracking what it lets you ship. The point of shedding is not to avoid crashes; it is to maintain user-visible value under overload. If the shed traffic is the wrong traffic, or the shed decision is too late, or the shed-and-retry loop produces a cascade, the shedding is producing the same outcome as no shedding — just slower. The metric that matters is goodput under overload, not availability. Availability says "the service responded with something"; goodput says "the user got what they asked for". The discipline of measuring goodput, not availability, is the harder cultural shift than the technical implementation. Razorpay's 2024 internal SRE manual makes this explicit: every overload incident's postmortem must include the goodput number, not just the availability number, and the corrective action must improve the next incident's goodput, not just its availability.

Common confusions

"Load shedding is the same as rate limiting" Rate limiting is a configured static cap (max 1000 req/s per API key) that exists for billing, fairness, or protection from abusive clients. Load shedding is a dynamic response to measured saturation, applied to all traffic regardless of who sent it. Rate limiting fires before saturation; load shedding fires at the boundary of saturation. A well-designed system has both — rate limiting protects against malicious peaks, load shedding protects against legitimate overload. They share implementation primitives (token buckets) but have different control loops.
"Adding more servers solves the load problem" It does, eventually — but provisioning takes minutes (autoscaling), tens of minutes (manual), or longer (capacity planning lag). Bursts last seconds. The latency of adding capacity is much longer than the duration of the spike that triggered the need. Load shedding is the mechanism that keeps the system useful during the gap between "we noticed we need capacity" and "the new capacity is online". A team that relies only on autoscaling is a team that drops every spike.
"You should always return 503 when shedding" The right response code is 429 (Too Many Requests) for client-driven shedding and 503 (Service Unavailable) for server-driven shedding. The distinction matters because well-behaved clients treat 429 as "try again with backoff" and 503 as "the service is down, escalate". Returning 503 for routine shedding teaches the client to escalate every spike, which produces user-facing alerts where there should be invisible recovery.
"Load shedding is only for stateless services" Stateful services need it more, not less. A database connection pool is a finite resource; once exhausted, every additional request waits indefinitely. The shed decision for a stateful service is "do not consume this connection unless we will complete the work in time"; the cost of getting it wrong (connection leak, deadlock, transaction rollback) is higher than for a stateless tier. Postgres's connection_limit and PgBouncer's transaction-pool semantics are both load-shedding primitives, applied at the database tier.
"Adaptive concurrency limits replace static limits" They complement, not replace. The static limit is your absolute ceiling — the value you would never let concurrency exceed regardless of measured RTT, because beyond it some other resource (memory, file descriptors, downstream-service quota) breaks. The adaptive limit floats below the static ceiling and discovers the real-time capacity. Both are needed — the static limit is the safety net; the adaptive limit is the day-to-day optimisation.
"Drop-oldest is the same as LIFO scheduling" Drop-oldest evicts the head of a FIFO queue when full; the queue itself is still FIFO. LIFO scheduling pulls work from the tail of the queue, leaving the head to age unboundedly. They sound similar but have opposite effects: drop-oldest preserves freshness by evicting stale work; LIFO scheduling produces stale work by ignoring the head. Drop-oldest is what you want; LIFO scheduling is a different (and usually wrong) choice.

Going deeper

The control-theory framing — load shedding as PID control

The Gradient2 algorithm is a proportional-only controller; it adjusts the limit based on the current gradient measurement, with no integral or derivative term. This works because the system's response to a limit change is fast (sub-second) and the workload is high-volume, so the noise averages out. For systems with slow response (a queue that takes minutes to drain after a limit change) or low volume (a few requests per second where each measurement is noisy), a full PID controller can be worth the implementation cost. The integral term smooths over noise; the derivative term anticipates rapid changes. Netflix's library exposes a Vegas variant with explicit P and D terms; teams running it on highly bursty workloads enable both. The control-theory literature on TCP congestion control ([Jacobson & Karels 1988] and the subsequent work on Reno, Vegas, BBR) is directly applicable — the math that governs a TCP sender's congestion window also governs an application's concurrency limit, because both are cases of "discover capacity through measured response".

Coordinated load shedding across a fleet

A single instance's adaptive limit is independent; in a fleet of 200 instances, each one probes capacity independently and they can disagree. This is fine in steady state, but during a coordinated event (a new feature launch that creates a global hot path) the fleet's collective shedding can be miscalibrated. Two patterns help. Push-based limit coordination: a central control plane periodically aggregates per-instance limits and recomputes a fleet-wide guidance value, which instances use as their starting point after restart. Sidecar-mediated shedding: every instance publishes its current limit and current load to a sidecar; the sidecar can shift load away from saturated peers via traffic routing. Both add complexity; the second is what Netflix runs in production. Smaller teams typically rely on per-instance independence and accept that the fleet's collective behaviour has 2-5% noise around the optimum.

Shedding under coordinated omission

A subtle bug in benchmarks of load-shedding systems is coordinated omission in the load generator. If the generator pauses when responses are slow (which wrk does without -R), the burst that the system saw in production is never reproduced — the generator throttles itself in sympathy with the server. The result: the load-shedding mechanism appears to work in benchmarks because the generator is doing the operator's job. In production, with real clients that retry aggressively rather than pausing, the same shedding mechanism may be inadequate. The fix is to use wrk2 (open-loop, constant rate regardless of server response) or vegeta (rate-paced from the start). Every load-shedding benchmark this curriculum cares about must be validated with an open-loop generator and HdrHistogram-corrected percentiles. See /wiki/coordinated-omission-the-bug-in-your-benchmark for the underlying mechanism.

Why circuit breakers are not load shedders

Circuit breakers (the Hystrix-style primitive) detect downstream-service failure and stop sending requests; they are about upstream protection from downstream failures. Load shedders detect upstream-service saturation and refuse incoming requests; they are about self-protection from offered load. The two solve different problems and need both: a circuit breaker without load shedding can still be saturated by upstream traffic; a load shedder without circuit breakers will hold connections to dead downstream services until timeout. The Netflix architecture has both at every service boundary, with the circuit breaker fronting outgoing calls and the load shedder fronting incoming calls. They share data (the load shedder may use circuit-breaker open-state as an input to its limit decision) but have separate state machines and separate metrics. Teams that conflate them — "we have Hystrix, we don't need shedding" — end up with the unmanaged-overload pathology described in §1.

Reproduce this on your laptop

# Reproduce the goodput-collapse and drop-oldest experiments from this chapter.
sudo apt install python3-pip
python3 -m venv .venv && source .venv/bin/activate
pip install simpy hdrh

# Adaptive concurrency limit
python3 adaptive_concurrency.py
# Expected: limit settles near 30, ~70% of requests shed at 4× offered load.

# Drop-oldest vs drop-newest goodput
python3 drop_oldest_queue.py
# Expected: drop-oldest produces 90%+ goodput; drop-newest 25-30%.

# Open-loop load test (the only honest way to validate shedding)
sudo apt install wrk2
wrk2 -t4 -c200 -d30s -R 5000 --latency http://localhost:8080/api
# Expected: HdrHistogram output with p50/p99/p99.9; p99 stays bounded
# under the load shedder and explodes without it.

Run these on a laptop, then on whatever the team's actual production hardware is. The absolute numbers vary; the shape (goodput collapse without shedding, plateau with) is hardware-independent. The shape is the lesson.

Where this leads next

Part 16 — case studies — is the curriculum's synthesis section, and load shedding is one of its load-bearing themes. Every subsequent case study touches it from a different angle: how Twitter's caching migration relied on shed-aware degradation, how Stripe's idempotency layer interacts with shed-and-retry, how Cloudflare's edge shedding scales across geographies. The pattern is recurring because the underlying constraint — capacity is finite and offered load is unbounded — is universal.

The natural next reads are:

/wiki/twitters-caching-migration — the next case study, where caching migration was the shed-mechanism for read-heavy overload.
/wiki/m-m-1-and-why-utilization-80-hurts — the queueing-theory chapter that derives the 0.85-knee number every load shedder targets.
/wiki/coordinated-omission-the-bug-in-your-benchmark — the benchmarking failure mode that hides shedding bugs.
/wiki/the-single-threaded-redis-lesson — the previous case study, on synchronisation-cost trade-offs.

The chapter after this — Twitter's caching migration — picks up a complementary theme: what to do when even shedding-aware capacity is not enough, and the architecture itself must change to fit the load. The Netflix lesson is "shed gracefully when overloaded"; the Twitter lesson is "the architecture that creates the overload may be the wrong architecture". Both are about not pretending — Netflix's shedding does not pretend the cluster has more capacity than it does, and Twitter's migration did not pretend the old caching layer could survive the new traffic shape.

A final pointer for readers who want to push deeper on the discipline rather than the system: load shedding generalises beyond web services. CPU schedulers shed work via priority preemption; database query planners shed work via abort-and-retry under deadlock; storage systems shed work via backpressure on writers. Each of these is a different domain with the same arithmetic — capacity is finite, offered load is unbounded, and the only honest answer is to refuse work in priority order. The teams that internalise this principle ship systems that hold their behaviour under stress; the teams that do not ship systems that fail in surprising ways the first time real traffic shows up.

References

Netflix Tech Blog — Performance Under Load (2017) — the original public write-up of Gradient2 and adaptive concurrency limits at Netflix.
Netflix concurrency-limits library — the open-source implementation that codifies the algorithms in this chapter.
Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM, 2013) — the classic on tail-latency management, cited here for the "drop oldest" insight and hedged requests.
Marc Brooker, "Caution: Decreased performance of EBS volumes under high load" (AWS re:Invent 2017) — the AWS view of load shedding at the storage layer; complementary to the API-layer view in this chapter.
Gil Tene, "How NOT to Measure Latency" — coordinated omission in benchmarks of load-shedding systems.
Brendan Gregg, Systems Performance (2nd ed., 2020), Ch. 12 (Capacity Planning) — the queueing-theory and capacity-planning foundations the Netflix mechanisms operate on.
/wiki/m-m-1-and-why-utilization-80-hurts — the queueing-theory chapter that derives why ρ approaching 1 is unrecoverable, the mathematical justification for shedding before saturation.
/wiki/coordinated-omission-the-bug-in-your-benchmark — the benchmarking failure mode that masks load-shedding bugs in pre-production testing.