Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Load shedding strategies

At 22:47 IST on a Sunday in May, SetuStream's recommendation service crossed an offered load of 1.4M RPS — the final over of the IPL final, RCB needing 14 off the last over, every viewer's app simultaneously refreshing the related-matches strip. The provisioned capacity is 1.0M RPS at p99 = 200 ms. The autoscaler can add replicas, but pod cold-start is 95 seconds and the spike will be over in 40. Two paths exist. Path A: queue the 0.4M extra requests, watch every queue depth climb, watch p99 cross 8 seconds, watch every client retry, watch retry storms double the offered load to 2.8M, watch the entire fleet melt. Path B: refuse 0.4M requests immediately with HTTP 503, return cached fallback content client-side, keep the 1.0M served requests at p99 = 240 ms. Path A is the default of every system that has never thought about overload. Path B requires deciding, in advance, exactly which 0.4M to refuse and how — that decision is load shedding.

Load shedding is the deliberate act of refusing some requests during overload so that the remaining requests keep meeting their SLO. The alternative — accepting every request and queueing the excess — pushes utilisation past the queueing knee at ρ ≈ 0.85, where latency grows without bound and triggers retry storms that amplify the original overload. Effective shedding picks a priority signal (admission control by user tier, by request type, by deadline, by resource cost), enforces it at a load-aware enforcement point (CPU > 80%, queue depth > N, p99 > target), and signals refusal in a way clients respect (HTTP 503 + Retry-After, gRPC RESOURCE_EXHAUSTED, explicit backoff hints) so the shed traffic does not return as a retry storm. A system without load shedding is one autoscale lag away from cascading failure.

Why queueing the excess is the worst option

The intuition that says "just queue the extra requests, the spike will pass" is wrong in a specific, mathematically inevitable way. From queueing theory (/wiki/m-m-1-and-why-utilization-80-hurts), an M/M/1 queue's mean response time is T = 1 / (μ − λ) — when arrival rate λ approaches service rate μ, response time goes to infinity. The knee of the curve is around ρ = λ/μ = 0.85. Past that point, every additional 1% of utilisation roughly doubles the queue depth. At ρ = 0.95, you sit on a tail that costs 20× the unloaded response time. At ρ = 0.99, it is 100×. At ρ = 1.0, queue depth grows linearly with time and never recovers without dropping load.

The real systems version is worse than the textbook version, for three reasons.

Retries amplify overload. When p99 climbs from 200 ms to 4 s during the spike, every client whose request takes more than its 800 ms timeout retries. A single retry on a 3-attempt policy converts 1 user request into up to 3 backend requests. If 30% of requests time out and retry, the offered load multiplies by 1 + 0.3 + 0.09 = 1.39×. The amplified load makes the queue grow faster, which makes more requests time out, which generates more retries. This is the metastable failure pattern — a small overload becomes a self-sustaining one through retry feedback. The system that survives a 40-second spike with shedding fails for hours without it because the retry storm outlasts the original demand.

Queueing burns the tail latency budget for no gain. A request that sits in the queue for 3.8 seconds before being processed is, from the user's perspective, indistinguishable from a request that was refused — the user has long since closed the app. But the backend spent CPU and connection slots on the request anyway. Queueing converts a refusal that costs zero into a "successful" response that costs full service time and is still useless. Shedding the request immediately frees that CPU and that connection slot to serve a request that the user will actually still be waiting for.

Memory pressure becomes the real failure. A queue is a buffer in RAM. At 200 KB per pending request (HTTP headers, parsed JSON body, response builder), a queue of 100K requests is 20 GB of RAM. The pod's memory limit is 16 GB. Long before latency becomes the visible symptom, the OOM killer reaps the pod, all queued requests fail, and the autoscaler schedules a replacement that takes 95 seconds to be ready. The seconds the queue bought are paid back by minutes of capacity gap.

Two responses to overload — queue everything vs shed earlyTwo parallel timeline strips. Top: queue-everything path showing latency rising from 200ms to 8s, retries amplifying load, OOM kill at 90s, full capacity loss at 95s. Bottom: shed-early path showing p99 staying near 250ms while 30% of requests get a fast 503, retries do not amplify because clients respect Retry-After.What happens during a 40-second offered-load spike to 1.4× capacityPath A — queue everything (no shedding)t=0 spikep99=200mst=15 queue growsp99=2st=30 retry stormload = 2.8×t=60 p99=8smemory fullt=90 OOMcapacity = 0outcome: 0% served, fleet melts, recovery takes 8 minutesPath B — shed 30% at the door (load shedding)t=0 spikeshed = 0%t=5 CPU > 80%shed = 30%t=15 stablep99=240mst=40 spike endsshed = 0%t=45 normalp99=200msoutcome: 70% served at SLO, 30% get fast 503 + cached fallback
Without shedding, a brief spike triggers a self-sustaining retry storm and ends in OOM. With shedding, the spike is absorbed by refusing 30% of requests fast — those clients show cached content, the served 70% stay at SLO, and the system recovers in seconds when the spike passes. Illustrative timeline, modelled on SetuStream IPL-final shape; exact numbers vary by service.

Why metastable failure is the real risk and not just bad latency: a system in normal operation has a stable equilibrium at low queue depth. A brief spike pushes it to a higher equilibrium with longer queues, but if retry traffic generated by the elevated latency exceeds the original headroom, the higher equilibrium is self-sustaining — the system stays at the bad equilibrium even after the original spike ends, because the retries it generates keep loading itself. The metastable state is broken only by external intervention (operator reduces capacity to force shedding, or kills the fleet). Load shedding prevents the system from ever entering the metastable state by refusing the excess load before queue depths climb.

A runnable load-shedding harness

The simplest correct shedding implementation refuses requests when an enforcement signal exceeds a threshold, tracks per-priority refusal rates so ops can see what is being shed, and emits proper HTTP signals so well-behaved clients back off rather than retry-stormed. Below is a runnable Python aiohttp server that demonstrates CPU-based shedding with priority tiers, simulates a 40-second IPL-shaped spike, and prints the served-vs-shed timeline.

# shed_demo.py — load-shedding harness with priority tiers and CPU enforcement
# Run: python3 shed_demo.py  (then in another shell: python3 shed_demo.py drive)
import asyncio, time, random, sys, json, statistics
from aiohttp import web, ClientSession

# ---------- shedding policy ----------
CPU_LIMIT_PCT      = 80      # shed when cpu_pct > this
QUEUE_LIMIT        = 200     # also shed when in-flight > this
TIER_PRIORITIES    = {"premium": 100, "standard": 50, "free": 10}
SHED_THRESH_BY_CPU = {70: 10, 80: 30, 90: 60, 95: 90}  # cpu_pct -> drop% for free tier

# ---------- live state, updated by a 1-Hz sampler ----------
state = {"cpu_pct": 30, "inflight": 0, "served": 0,
         "shed_by_tier": {"premium": 0, "standard": 0, "free": 0},
         "served_by_tier": {"premium": 0, "standard": 0, "free": 0},
         "lat_ms": []}

def shed_threshold_for(cpu_pct):
    """Return drop fraction to apply at the FREE tier; higher tiers scale linearly."""
    pct = 0
    for k in sorted(SHED_THRESH_BY_CPU):
        if cpu_pct >= k: pct = SHED_THRESH_BY_CPU[k]
    return pct / 100.0

def should_shed(tier):
    """Decide if this incoming request should be shed."""
    if state["inflight"] > QUEUE_LIMIT:
        return True   # queue-depth backstop, regardless of tier
    drop_at_free = shed_threshold_for(state["cpu_pct"])
    if drop_at_free <= 0:
        return False
    # scale by priority — premium rarely shed, free shed first
    pri = TIER_PRIORITIES[tier]
    pri_factor = TIER_PRIORITIES["free"] / pri    # 1.0 for free, 0.1 for premium
    return random.random() < (drop_at_free * pri_factor)

async def handler(request):
    tier = request.headers.get("x-tier", "standard")
    if should_shed(tier):
        state["shed_by_tier"][tier] += 1
        return web.Response(status=503, headers={"Retry-After": "2"},
                            text=json.dumps({"shed": True, "tier": tier}))
    state["inflight"] += 1
    t0 = time.perf_counter_ns()
    try:
        # simulate work — 30 ms CPU-bound + 10 ms I/O-bound
        await asyncio.sleep(0.03 + random.random() * 0.02)
        state["served"] += 1
        state["served_by_tier"][tier] += 1
        state["lat_ms"].append((time.perf_counter_ns() - t0) / 1e6)
        if len(state["lat_ms"]) > 5000:
            state["lat_ms"] = state["lat_ms"][-5000:]
        return web.Response(status=200, text=json.dumps({"tier": tier, "served": True}))
    finally:
        state["inflight"] -= 1

async def cpu_sampler():
    """Update cpu_pct from inflight count — proxy for real /proc/stat sampling."""
    while True:
        # in production: read /proc/stat or psutil.cpu_percent()
        # here we model cpu as a function of inflight requests
        state["cpu_pct"] = min(99, 25 + state["inflight"] * 0.6)
        await asyncio.sleep(1)

async def reporter():
    last = {"served": 0, "shed_total": 0}
    while True:
        await asyncio.sleep(2)
        shed_tot = sum(state["shed_by_tier"].values())
        d_serv = state["served"] - last["served"]; d_shed = shed_tot - last["shed_total"]
        p99 = (sorted(state["lat_ms"])[int(len(state["lat_ms"]) * 0.99)]
               if len(state["lat_ms"]) > 100 else 0)
        print(f"cpu={state['cpu_pct']:5.1f}%  inflight={state['inflight']:4d}  "
              f"served+={d_serv:5d}  shed+={d_shed:5d}  p99={p99:6.1f}ms  "
              f"shed[free/std/prem]={state['shed_by_tier']['free']:5d}/"
              f"{state['shed_by_tier']['standard']:5d}/{state['shed_by_tier']['premium']:4d}")
        last["served"] = state["served"]; last["shed_total"] = shed_tot

async def init_app():
    app = web.Application()
    app.router.add_get("/api", handler)
    app.on_startup.append(lambda a: asyncio.create_task(cpu_sampler()) and None)
    app.on_startup.append(lambda a: asyncio.create_task(reporter()) and None)
    return app

# ---------- driver: simulate IPL spike — 600 RPS baseline, spike to 2200 RPS at t=10 ----------
async def drive():
    async with ClientSession() as sess:
        async def fire(tier):
            try:
                async with sess.get("http://localhost:8080/api",
                                    headers={"x-tier": tier}, timeout=2) as r:
                    return r.status
            except Exception:
                return 0
        t_start = time.time()
        while time.time() - t_start < 60:
            elapsed = time.time() - t_start
            rps = 2200 if 10 < elapsed < 50 else 600
            tier = random.choices(["premium", "standard", "free"],
                                  weights=[5, 35, 60])[0]
            asyncio.create_task(fire(tier))
            await asyncio.sleep(1.0 / rps)
        await asyncio.sleep(2)

if __name__ == "__main__":
    if len(sys.argv) > 1 and sys.argv[1] == "drive":
        asyncio.run(drive())
    else:
        web.run_app(init_app(), port=8080)

Sample run output during the simulated spike — server window:

$ python3 shed_demo.py &
$ python3 shed_demo.py drive
cpu= 28.6%  inflight=    6  served+= 1198  shed+=    0  p99=  46.2ms  shed[free/std/prem]=    0/    0/   0
cpu= 31.4%  inflight=   11  served+= 1204  shed+=    0  p99=  47.1ms  shed[free/std/prem]=    0/    0/   0
cpu= 78.4%  inflight=   89  served+= 2840  shed+=  321  p99=  68.8ms  shed[free/std/prem]=  301/   18/   2
cpu= 89.2%  inflight=  107  served+= 3104  shed+= 1198  p99=  72.4ms  shed[free/std/prem]= 1051/  138/   9
cpu= 91.6%  inflight=  111  served+= 3122  shed+= 1216  p99=  74.1ms  shed[free/std/prem]= 1064/  142/  10
cpu= 88.8%  inflight=  106  served+= 3098  shed+= 1184  p99=  73.2ms  shed[free/std/prem]= 1041/  133/  10
cpu= 32.1%  inflight=   12  served+= 1206  shed+=    0  p99=  47.4ms  shed[free/std/prem]=    0/    0/   0

Walking the load-bearing lines. SHED_THRESH_BY_CPU = {70: 10, 80: 30, 90: 60, 95: 90} is the policy as a table — at 70% CPU, drop 10% of free-tier traffic; at 90%, drop 60%; at 95%, almost everything below premium. The schedule is intentionally non-linear and aggressive past 80% because the queueing knee sits around 80–85% utilisation; sliding past 80% is when latency starts climbing fast, so the drop fraction climbs fast too. pri_factor = TIER_PRIORITIES["free"] / pri scales the drop fraction by tier — premium gets 10/100 = 0.1× the free-tier drop rate, standard gets 10/50 = 0.2×. At 90% CPU, free is dropped 60%, standard 12%, premium 6%. if state["inflight"] > QUEUE_LIMIT: return True is the queue-depth backstop — even premium traffic gets shed when in-flight requests exceed the queue limit. This is the safety net for when CPU is misleading (e.g. blocked on I/O, low CPU but huge queue). return web.Response(status=503, headers={"Retry-After": "2"}, ...) is the client signal — HTTP 503 with a Retry-After header tells well-behaved clients to wait 2 s before retrying. A bare 503 without Retry-After invites instant retries, which defeats the shedding. if random.random() < (drop_at_free * pri_factor) uses random shedding rather than oldest-first or smallest-cost-first — this avoids correlated drops (every request from one client getting shed while another client's get all served) which would feel like an outage to the unlucky client.

Why CPU and queue-depth are both needed as enforcement signals: CPU alone misses I/O-bound saturation — a service waiting on a slow downstream may have low CPU but an enormous in-flight count, and shedding only on CPU lets the queue fill until the OOM killer arrives. Queue depth alone misses CPU-bound saturation — a service doing heavy compute may have a moderate queue but be CPU-saturated, with each request taking 4× normal time. Either signal alone has a blind spot the other covers. Production shedding policies always combine at least these two; sophisticated ones add downstream-latency sampling and explicit deadline-propagation. The "any of these signals trip" pattern is more robust than weighted formulas because one tripped signal is enough to indicate trouble — there is no safe combination of "high CPU plus low queue" or "low CPU plus huge queue" worth waiting on.

Picking the priority signal — what to shed first

The single highest-leverage decision in load-shedding design is what priority signal to shed by. Get this right and the shed traffic is the traffic the user least cares about; get it wrong and the shed traffic is the traffic the user most cares about. The four signals every production system should consider:

By user tier. Premium / business / verified users are protected at the cost of free / anonymous / guest users. This is the simplest and most common signal. PaisaBridge's payment API protects merchants on the Enterprise plan (₹50K+/month MRR) at the cost of merchants on the Self-Serve plan during overload — the policy is published in the SLA. SetuStream protects logged-in subscribers at the cost of anonymous landing-page hits. ParakhTrade protects active intraday traders at the cost of users opening their dashboard. The dimension is whatever your billing system cares about; the policy follows.

By request type. Reads can usually be shed before writes (lost reads are recoverable, lost writes may not be). Idempotent operations can be shed before non-idempotent ones (the client retry is safe). Background jobs / batch reports / analytics queries can be shed before interactive requests. BharatBazaar's catalogue API sheds product-detail enrichment calls (related-items, recently-viewed strips) long before it sheds the product-detail base call — the user can read about a phone without seeing related accessories, but a missing base response is a broken page. The strict ordering is analytics < background < cacheable-read < uncacheable-read < idempotent-write < non-idempotent-write — shed left to right.

By deadline. Each request carries an explicit deadline (e.g. x-deadline-ms: 200 from the client, or the gRPC grpc-timeout header). When the server picks the next request to process from its queue, it can drop requests whose deadline has already elapsed (waste of CPU to process them — the client has timed out anyway), and shed requests whose deadline is too tight to meet (a request with 50 ms remaining when current p99 is 240 ms cannot succeed; shed it now and free the slot). Querion's internal RPC framework propagates deadlines through every call; deadline-based shedding lets the system avoid useless work at every layer.

By resource cost. Some requests cost 100× more than others — a search query for a single token costs ~1 ms of compute, a search query with 12 facet filters and a geo-radius can cost 200 ms. When CPU is constrained, shedding the expensive long-tail queries protects the cheap-query throughput. The mechanism is request-cost classification at admission time (predict cost from request shape) plus a token-bucket cost limiter (allow up to N units of cost per second). ClearJourney's flight-search API uses this — fare-cache hits are free and never shed; full origin-destination-search-with-stopovers calls are expensive and are the first to shed under load.

The practical reality is that production shedders combine at least two signals — typically tier + cost — into a composite priority score, and shed in score order. The design exercise is figuring out which signals matter for your service, not adopting any single canonical scheme.

Four priority signals for load shedding, with shedding order from least- to most-protectedFour horizontal panels stacked vertically. Each shows a priority dimension with arrows indicating shedding order. Tier panel: free, standard, premium, enterprise. Type panel: analytics, background, cacheable-read, uncacheable-read, idempotent-write, non-idempotent-write. Deadline panel: expired, tight, healthy, generous. Cost panel: heavy, moderate, light, free.Pick a priority signal — shed left, protect rightBy tier (PaisaBridge payments, SetuStream)freestandardpremiumenterpriseshed firstprotect lastBy request type (BharatBazaar catalogue, generally applicable)analyticsbackgroundcache-readlive-readidem-writenon-idem writeBy deadline (gRPC deadline propagation, Querion-style)expired (drop free)tight (< p99)healthy (> p99)generous (> 5×)
Four production-validated priority dimensions. Most production systems combine at least two — typically tier and request type, or tier and deadline — into a composite score. The schema is not universal; the discipline of picking a schema that matches your billing and your reliability promises *is* universal.

The client side — making shed traffic stay shed

A shedding server that returns HTTP 503 to a client that retries instantly five times turned 1 dropped request into 5 dropped requests. The client's response to a shed signal is half of the problem — and unlike the server side, you usually do not control all the clients. Three layers of defence:

The HTTP signal must be unambiguous. Use HTTP 503 (Service Unavailable) for shed traffic — not 500 (server error), 502 (bad gateway), or 429 (too many requests, which means you specifically are over your rate limit). 503 explicitly means "the service is overloaded, your request is fine, retry later". Always include a Retry-After: <seconds> header. Mature client libraries (Retrofit on Android, URLSession on iOS, the Go net/http package via middleware, browser fetch with manual handling) honour Retry-After. The number you put there matters: too small (Retry-After: 1) and the retry hits while you are still overloaded; too large (Retry-After: 60) and the user gives up. Production systems set it dynamically — min(60, recovery_time_estimate) — based on how long the shedding policy expects to be active.

Client-side jitter on retry is mandatory. Without jitter, every shed client retries at exactly now + Retry-After seconds, generating a synchronised retry spike that is just the shed wave delayed. The fix is sleep(random.uniform(0.5, 1.5) * retry_after) — spread the retry wave across a second of jitter. PaisaBridge's mobile SDK applies 0.5×–2× jitter on every retry; their post-spike retry pattern is a smooth ramp rather than a wall.

Client-side circuit breakers cap absolute retries. Even with Retry-After and jitter, a client that keeps retrying forever during a 30-minute outage is itself an attacker. Mature clients implement circuit breakers — after 3 consecutive 503 responses with elapsed time > 10 s, stop retrying, surface a "service unavailable, try later" UI, let the user decide. The Hystrix circuit-breaker pattern (open / half-open / closed states) is the canonical model; modern equivalents in resilience4j (Java), polly (.NET), tenacity (Python), and gobreaker (Go) ship the same pattern. The circuit breaker on the client prevents the retry storm even when the server's signal is ignored or buggy.

The asymmetric reality is that you control your own first-party clients but not third-party ones. For first-party (your iOS app, your Android app, your web SPA), you ship Retry-After-aware retry logic, jitter, and circuit breakers in a shared SDK and enforce it through SDK upgrade requirements. For third-party (partner integrations, public API consumers), you can only signal — and the worst-behaved third-party client becomes a denial-of-service vector during overload. The defence at that layer is per-API-key rate limiting separate from load shedding: an API key that exceeds its budget gets 429 (their fault) rather than 503 (your fault), and the per-key budget is small enough that no single client can amplify a shed event.

A subtle detail: when your service is itself a client of an upstream that sheds, your retry logic is also part of the system's overall behaviour. A backend that aggressively retries upstream 503s during an upstream's overload contributes to the upstream's metastable failure. The discipline is symmetric — every layer must respect shedding signals from every adjacent layer, or the chain becomes a retry-amplifier.

Common confusions

  • "Load shedding is the same as rate limiting." No — rate limiting enforces a per-client budget regardless of system state; an API key gets 429 once it exceeds its budget even when the system is idle. Load shedding enforces a system-state-dependent policy; the same API key gets served when the system is healthy and shed when the system is overloaded. The two layer together: rate limiting first (prevents one client from being the spike), then load shedding (handles the spike when the aggregate of all rate-limited clients is still too much).
  • "HTTP 429 and 503 mean the same thing." No — 429 ("Too Many Requests") means "this client specifically has exceeded their rate limit"; the right client response is to slow down. 503 ("Service Unavailable") means "the service is overloaded, your request is fine"; the right client response is to retry later. Mixing them confuses clients into wrong retry behaviour. Use 429 for per-client limits, 503 for capacity-driven shedding.
  • "If you autoscale, you don't need load shedding." No — autoscaling cold-start time (60–120 s for typical pods, longer for stateful services) is much longer than the duration of most production spikes (5–60 s). Load shedding handles the gap between "spike begins" and "new replicas are ready". Even with infinite cloud capacity, you need shedding for the autoscaler-lag window. SetuStream's IPL-final shedding policy fires every match because the spike ends before any autoscaler-spawned pod is warm.
  • "Shedding is a sign of bad capacity planning." No — shedding is the rational response to demand variance that is too expensive to over-provision for. The IPL final's peak demand is 10× the normal day's demand; provisioning for the peak means 90% of capacity is idle 99% of the time. Shedding lets you provision for some sensible percentile of demand and gracefully degrade for the tail above it. Capacity planning's job is choosing the percentile; shedding's job is what to do above it.
  • "Random shedding is unfair; we should shed the slowest requests." No — "shed the slowest" sounds appealing but creates a perverse incentive: a client that crafts a slow request gets shed, but a client crafting a normal request gets served, so adversarial clients learn to spam normal-shaped requests during overload. Random shedding (uniform within a tier) is harder to game and statistically fair. Shed by cost prediction (request shape) rather than by observed slowness.
  • "You can shed retroactively — drop requests that have been queued too long." This is correct and not a substitute for admission-control shedding. Dropping queue-tail requests (LIFO with a max age, or timeout-based eviction) frees CPU for fresher requests, but it does not prevent the queue from growing in the first place — the requests still consumed memory, parsing time, and connection slots before being dropped. Combine the two: admission control prevents the queue from growing past the limit, queue-tail eviction handles the moments when admission control is too slow.

Going deeper

Adaptive concurrency limiting — Streamora's Concurrency-Limits library

Static thresholds (CPU > 80% → shed 30%) are easy to reason about but require manual tuning per service per traffic shape. Streamora's Concurrency-Limits library (open-sourced 2018, ports in Java/Go/Rust) takes a different approach: it treats the service as a black box, samples response time, and applies a TCP-style additive-increase-multiplicative-decrease (AIMD) algorithm to discover the maximum concurrency that keeps response time stable. When response time creeps up, the limit decreases; when response time is stable, the limit slowly increases. The advantage is that the limit auto-adapts to the current downstream behaviour — when an upstream gets slower, the limit drops without manual reconfig. The disadvantage is that the algorithm needs a few seconds of feedback to converge, so it lags sharp spikes. Production systems combine both: static CPU thresholds for the fast-spike case, AIMD for the slow-drift case.

Cooperative shedding via deadline propagation — gRPC and the Querion RPC stack

Querion's internal RPC framework (Stubby, then gRPC) propagates a deadline header (grpc-timeout) through every call. When request A from the user calls service B which calls service C, the deadline shrinks at each hop — if the user's request has 200 ms total, and service B has spent 50 ms, the call to C carries grpc-timeout: 150m. Each service can then shed requests whose deadline is too tight to be useful — at service C, an incoming request with grpc-timeout: 30m while p99 of the operation is 120 ms is shed immediately because it cannot succeed. The propagation creates a system-wide cooperative shedding policy that protects every hop without explicit coordination. Implementing this requires deadline-aware client libraries at every layer; the up-front investment is large, but the resulting overload behaviour is qualitatively better than per-service-isolated shedding.

Aperture — the per-service rate-limiter mesh

FluxNinja's Aperture (open-source, 2023+) and Cabline's Envoy global rate limiting are sidecar/control-plane systems that put rate-limiting and shedding decisions in a separate component, with policies defined declaratively (YAML / OPA / Rego). The advantage over in-process shedding is policy uniformity across a polyglot fleet — a Java service and a Go service share the same shedding policy enforced by their Envoy sidecar, with no per-language reimplementation. The control plane also enables global rate limiting (rate limit by API key across the entire fleet, not per-replica), which prevents the failure mode where a client routed to 100 replicas gets 100× their per-replica limit. The trade-off is operational complexity — the control plane is now a critical-path dependency, and its failure mode (shedding-policy not reachable) must itself be designed for.

Sample reproduction — measuring the metastable failure mode

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install aiohttp

# Terminal 1 — start the shedding server
python3 shed_demo.py

# Terminal 2 — drive the IPL-shaped spike
python3 shed_demo.py drive

# Now disable shedding (set CPU_LIMIT_PCT = 9999) and re-run
# Watch p99 climb past 1000 ms, watch in-flight cross 500, watch the OOM
sed -i 's/CPU_LIMIT_PCT      = 80/CPU_LIMIT_PCT      = 9999/' shed_demo.py
python3 shed_demo.py & python3 shed_demo.py drive

The contrast between the two runs is the entire pedagogical content — with shedding, p99 stays under 100 ms even at 2200 RPS offered load; without it, p99 climbs into the seconds and inflight count grows without bound until the process is killed. The same Python script demonstrates both the disease and the cure.

What "graceful degradation" actually means at the product layer

Load shedding at the infrastructure layer pairs with graceful degradation at the product layer — the UI's response to a 503 should not be a blank error page. SetuStream's home screen, when the recommendation API sheds, falls back to a static editorial carousel served from a CDN edge cache. BharatBazaar's product detail page, when the related-items strip sheds, hides the strip without showing a broken section. PaisaBridge's checkout, when the secondary fraud-check API sheds, silently skips the secondary check (the primary one is mandatory). The product-side graceful degradation is what makes shedding invisible to most users — they get cached or partial content instead of an error page. Without product-side fallbacks, every shed request becomes a user-visible error, and the perceived availability drops in lock-step with the shed rate. With product-side fallbacks, perceived availability stays close to 100% even when actual served-request rate drops to 70%.

Where this leads next

Load shedding is one half of the overload-handling pair; backpressure (the upstream-to-downstream signal that flow-controls earlier in the chain) is the other.

The closing rule: a system without explicit load shedding has implicit load shedding via OOM kills, autoscaler exhaustion, and retry storms — and the implicit shedding is always worse than the explicit kind. Pick the priority signal (tier / type / deadline / cost), pick the enforcement signal (CPU / queue depth / p99), pick the client signal (HTTP 503 + Retry-After + jitter + circuit breaker), measure what fraction of demand you shed during the worst hour of the worst day, and own that number as a service-level SLI. The systems that survive Mega Bargain Days, IPL finals, and Tatkal hour are the ones that decided in advance which 30% to refuse.

References