Degradation modes
At 20:14 IST on a Saturday, MealRush's recommendation backend — the service that decides which 12 restaurants land on the home screen — loses the read replica it leans on for personalisation features. The replica did not crash; it stopped accepting connections at 9 800 active sockets because the connection-pool ceiling was 10 000 and the application kept opening new ones to chase tail-latency. Personalisation queries start timing out at 600 ms each. The recommendation service has two choices in that moment. Choice A: surface the timeout, return 503 Service Unavailable, and let the home screen show a "We are having trouble" banner — true to the failure, useless to the user, and the App-Store rating drops six basis points by Sunday. Choice B: detect the personalisation timeout in 50 ms, fall back to a cached feature vector that is 11 minutes stale, blend it with a non-personalised popularity ranker, label the response degraded=true in the trace header, and ship 12 restaurants that are 70% as good as a fresh personalised list. The user sees a working app. The on-call engineer sees a single dashboard widget turn amber. MealRush ships ₹4.2 crore of orders that evening that they would have lost on Choice A. Degradation modes are the engineering of Choice B — deciding in advance which features can be sacrificed, in what order, with what fallback, and how the service signals that it is operating in a degraded state.
A degraded service returns a worse-but-still-useful response when it cannot meet its full-quality SLO. Degradation is encoded as a ladder of operating modes — from fresh+personalised (mode 0) to cached (mode 1) to default-bucket (mode 2) to "service unavailable" (mode 3) — with explicit triggers, fallback responses, and observable signals. Done well, p99 stays inside SLO under 5× overload; done badly, the degradation paths rot because they only run at 03:00 IST during incidents and break the next time you need them.
What degradation actually is — a contract negotiation
A non-degraded service makes one promise: "I return the right answer in time X with probability ≥ p". A degraded service makes a family of promises ranked by quality: "I return the full answer in time X if I can; if I cannot, I return a worse answer in the same time X; if I cannot do that either, I return 503 in 1 ms." The contract is preserved in latency and availability, sacrificed in quality. This is the opposite of load shedding, which sacrifices availability (some requests get 503) to preserve quality (admitted requests get the full answer). The two are complementary — most production services run both, with degradation tried first and shedding as the floor when even the cheapest degradation path is overloaded.
Why "degradation" is a contract, not a hack: a fallback that is not in the SLO is not a degradation — it is a bug. If you return a stale cache "sometimes" without declaring that the cache is part of the contract, callers cannot reason about what they get. Production-grade degradation requires three things in writing: (1) an explicit mode ladder with named modes, (2) a trigger for moving between modes that is reproducible and testable, (3) a signal in the response (header, trace tag, metric label) that says which mode served this request. Without all three, you have a fallback path that nobody exercises and nobody trusts.
A useful mental model is the CAP-style trade-off applied within a single service:
- Under normal conditions, you want fresh + personalised + complete (the full quality answer).
- Under partial failure, you choose two of: fresh, personalised, complete.
- Under more failure, you choose one.
- Under total failure, you give up and shed.
Each step down the ladder is a deliberate loss. The art is deciding what to lose first.
Triggers — how a service decides to step down
A trigger is the predicate that flips the operating mode. Picking the wrong trigger is the most common failure: by the time CPU% has crossed a threshold, the queues are already full and the latency damage is done. Like the load-shedder signals from the previous chapter, triggers should react in milliseconds and measure symptoms (latency, error rate, queueing delay) rather than causes (CPU, memory).
Trigger taxonomy, ranked by how production-fit they are:
- Per-dependency latency or error rate — measure each downstream call's p99 and error rate over a 10-second sliding window. When the feature store's p99 crosses 200 ms or its error rate crosses 5%, step down to the cache. When the cache's error rate crosses 5%, step down to the popularity ranker. This is per-dependency because partial failures are the norm — losing one feature should not collapse the whole response.
- Saturation thresholds — CPU PSI > 80%, in-flight concurrency > 0.9 of the configured limit, queue depth > target. Saturation triggers fire when the service itself is the bottleneck. Pair them with degradation paths that do less work (skip the personalisation step, return fewer items, drop the ranker rerank stage) rather than fallback paths that hit a different system.
- Deadline-budget triggers — every incoming request carries an
x-deadline-msheader (chapter 45). Before each downstream call, checkremaining_budget < dependency_p99 * 1.5. If yes, skip the dependency and use its fallback. This is the most surgical trigger because it is per-request, not service-wide — only requests that genuinely cannot afford the dependency take the fallback. - Manual feature flags — every degradation mode has a kill switch (
feature_flag.recommendations.skip_personalisation = true). The on-call engineer flips it during an incident, the change propagates to all instances in 5–10 seconds via the config service, and the service is degraded by fiat. Used during incidents that the automatic triggers do not detect — for example a downstream service is fast but returning wrong answers. - Time-of-day or load-bucket triggers — degrade preemptively during known-hot windows. CricStream's recommendation service has reduced personalisation depth between 19:30 and 22:30 IST during cricket finals because the alternative is to provision 4× the fleet for two hours a day. This is degradation as capacity planning, not as failure handling.
Why per-dependency triggers beat service-wide triggers: a service-wide trigger ("if anything is bad, degrade everything") gives up too much under partial failure. A microservice that depends on six downstreams might find the feature store slow, the user-profile service healthy, and the inventory service intermittent — a per-dependency trigger lets each downstream's fallback fire independently, so the response is degraded only on the dimensions where the dependencies fail. Most fan-out responses can survive losing 1–2 dimensions and still be 80% as useful.
A working degradation harness — Python with three modes and signals
# degradation_demo.py — three-mode degradation harness with per-dep triggers
import asyncio, time, random
from dataclasses import dataclass, field
from collections import deque
@dataclass
class HealthWindow:
"""Sliding-window p99 and error-rate estimator (10s, EWMA-style for simplicity)."""
samples: deque = field(default_factory=lambda: deque(maxlen=200))
errors: deque = field(default_factory=lambda: deque(maxlen=200))
def record(self, latency_ms: float, error: bool):
self.samples.append(latency_ms)
self.errors.append(1 if error else 0)
def p99_ms(self) -> float:
if not self.samples: return 0.0
s = sorted(self.samples)
return s[int(0.99 * (len(s) - 1))]
def error_rate(self) -> float:
if not self.errors: return 0.0
return sum(self.errors) / len(self.errors)
@dataclass
class DegradationController:
feat_store_p99_target: float = 200.0
feat_store_err_target: float = 0.05
cache_err_target: float = 0.05
feat_store: HealthWindow = field(default_factory=HealthWindow)
cache: HealthWindow = field(default_factory=HealthWindow)
def current_mode(self) -> int:
if self.cache.error_rate() > self.cache_err_target:
return 2 # cache also failing → popularity
if self.feat_store.p99_ms() > self.feat_store_p99_target \
or self.feat_store.error_rate() > self.feat_store_err_target:
return 1 # feature store unhealthy → cache
return 0 # all healthy → fresh + personalised
async def call_feature_store(latency_ms: float, error_rate: float) -> dict:
await asyncio.sleep(latency_ms / 1000)
if random.random() < error_rate:
raise RuntimeError("feature_store_5xx")
return {"vec": [0.42] * 64, "fresh": True}
async def call_cache() -> dict:
await asyncio.sleep(0.020) # 20ms
return {"vec": [0.39] * 64, "fresh": False, "stale_minutes": 11}
def popularity_ranker(city: str) -> dict:
return {"vec": None, "items": ["paneer-tikka", "biryani", "dosa"], "fresh": False}
async def serve(controller, city: str, fs_lat: float, fs_err: float):
mode = controller.current_mode()
started = time.monotonic()
if mode == 0:
try:
t0 = time.monotonic()
features = await call_feature_store(fs_lat, fs_err)
controller.feat_store.record((time.monotonic() - t0) * 1000, False)
return {"mode": 0, "items": [f"r{i}" for i in range(12)],
"x_degraded": "0", "latency_ms": (time.monotonic() - started) * 1000}
except Exception:
controller.feat_store.record((time.monotonic() - t0) * 1000, True)
mode = 1 # this request falls through to mode 1
if mode == 1:
try:
features = await call_cache()
controller.cache.record(20, False)
return {"mode": 1, "items": [f"r{i}" for i in range(12)],
"x_degraded": "1", "latency_ms": (time.monotonic() - started) * 1000}
except Exception:
controller.cache.record(20, True)
mode = 2
return {"mode": 2, "items": popularity_ranker(city)["items"],
"x_degraded": "2", "latency_ms": (time.monotonic() - started) * 1000}
async def main():
ctrl = DegradationController()
print("PHASE 1 — feature store healthy (p99=80ms, err=1%)")
results = await asyncio.gather(*[serve(ctrl, "blr", 80, 0.01) for _ in range(200)])
by_mode = {m: 0 for m in range(3)}
for r in results: by_mode[r["mode"]] += 1
print(f" mode counts: {by_mode} fs_p99={ctrl.feat_store.p99_ms():.0f}ms")
print("PHASE 2 — feature store degraded (p99=400ms, err=8%)")
for _ in range(50): ctrl.feat_store.record(400, random.random() < 0.08)
results = await asyncio.gather(*[serve(ctrl, "blr", 400, 0.08) for _ in range(200)])
by_mode = {m: 0 for m in range(3)}
for r in results: by_mode[r["mode"]] += 1
print(f" mode counts: {by_mode} fs_p99={ctrl.feat_store.p99_ms():.0f}ms")
asyncio.run(main())
Sample run on Python 3.11:
PHASE 1 — feature store healthy (p99=80ms, err=1%)
mode counts: {0: 198, 1: 2, 2: 0} fs_p99=84ms
PHASE 2 — feature store degraded (p99=400ms, err=8%)
mode counts: {0: 16, 1: 168, 2: 16} fs_p99=400ms
Per-line walkthrough. HealthWindow keeps 200 samples (latency and error indicator) per dependency — small enough to react fast, large enough that a single outlier does not flip the mode. current_mode() is the trigger: it inspects the windows and decides which mode the service is in right now. Note the cascading checks — the cache's error rate is checked first because if both the feature store and the cache are dying, mode 2 is the only option. serve() has the fall-through pattern: if the controller says mode 0 but the feature store call fails for this request, the request itself falls through to mode 1 even though the controller has not flipped yet. Why per-request fall-through plus controller-level mode: the controller-level mode debounces the trigger across many requests (slow to flip, slow to flip back, stable), and the per-request fall-through handles the case where one request hits a transient error inside an otherwise-healthy mode. Together they give you both stability and instantaneous responsiveness — most production degradation harnesses I have read encode this two-layer pattern. The output shows the mechanism: in phase 1, 99% of requests serve mode 0 with the few failures falling through to mode 1; in phase 2, the controller has flipped to mode 1 globally, the 16 mode-0 results are requests that started before the controller flipped, and the 16 mode-2 results are cache failures cascading further down.
Brownouts vs blackouts — the operational distinction
A blackout is total unavailability — the service returns 5xx for every request, the dashboard goes red, the on-call gets paged. Mean time to detect: seconds. Mean time to repair: minutes to hours. Customer impact: total during the window.
A brownout is partial degradation — the service is up, returning some answers in degraded modes, the dashboard turns amber, the on-call gets a warning rather than a page. Mean time to detect: requires the right metric to be plotted. Mean time to repair: variable, often "monitor and let auto-recovery handle it". Customer impact: subtle, usually invisible to most users.
The hard truth: brownouts cause more total customer harm than blackouts at large services, because blackouts are detected and fixed in 30 minutes while brownouts can run for hours unnoticed. KapitalKite's portfolio-page service ran in a degraded mode for 11 days in March 2026 because the personalised "what to invest next" rail had silently fallen back to a city-wide popularity ranker — a deploy three weeks earlier had broken the personalisation feature store's circuit breaker, the breaker was permanently open, the popularity ranker was serving every request, and no metric existed that flagged "we are running in mode 2 100% of the time". The trade was 0.7% lower click-through across 11 days × 4.1M daily-active users × ₹3 800 average trade size — meaningful revenue loss, invisible to monitoring. The fix was not a code change; it was adding degradation_mode as a label on the requests-served metric so the dashboard would have shown a flat amber line for the whole period.
The operational implication: every degradation mode must produce a metric, every metric must be on a dashboard, every dashboard must have an alert that fires when the mode-distribution shifts. A reasonable alert is "more than 10% of traffic in mode ≥ 1 for more than 5 minutes" — that thresholds out healthy fall-throughs and alerts on sustained degradation. The metric label that makes this work is one extra dimension on requests_served_total{mode="0|1|2|3"}. KapitalKite's 11-day brownout is the textbook reason every engineer reads this paragraph and adds the label by Friday.
Common confusions
-
"Degradation is the same as load shedding." Shedding rejects requests with
503to preserve the quality of accepted requests. Degradation accepts every request and returns a worse-quality response. Both are admission-control patterns, but they sit on opposite sides of a knob: shedding chooses which requests to serve, degradation chooses how to serve them. Production services run both — degradation first to extract quality from a constrained service, shedding last when even the cheapest degradation is overloaded. -
"A fallback is a degradation mode." A fallback is the mechanism —
try the cache if the feature store throws. A degradation mode is the contract — "mode 1 is documented as 70% of mode-0 quality, signalled to the client, and alerted on if it dominates traffic". A fallback without a contract rots: nobody monitors it, nobody tests it, the team that owns the feature store does not know its consumers depend on a cache, and the next deploy that breaks the cache surfaces an outage that should have been a brownout. -
"Degradation can be invisible to clients." It usually should be visible, even if subtly. The
x-degradedheader (or trace tag, or custom metric label) is the cheapest contract — clients that care can react (refresh more aggressively, show a small "live results unavailable" banner, etc.); clients that do not care simply ignore the header and continue. Pure invisibility is a brownout — the service is degraded and nobody can tell. Pure visibility is overkill — every degradation pops a banner. The middle is correct. -
"Cached fallback is always safer than refusing." Stale data can be worse than no data. A KapitalKite portfolio dashboard that shows 11-minute-stale prices is dangerous — a user might place a buy at the cached price and discover the live price has moved 3% against them. A MealRush "popular near you" rail with 11-minute staleness is fine — restaurant rankings move slowly. The rule: cached fallback is only safe when the staleness is bounded and the bound is acceptable to the use case. For trading, ML inference on financial data, and inventory reservation, the bound is often "≤ 1 second" — almost no useful cache fits.
-
"You can add degradation modes after launch." You can, but you will not. The team's mental model of the service hardens around the original mode-0 contract; downstream callers build assumptions on always-fresh always-personalised always-complete responses; adding a degraded mode later breaks those assumptions silently. Design the mode ladder before launch. Even if mode 1 is "return mode 0" (a no-op), the interface is in place for the day mode 1 needs to do real work.
-
"Brownouts are easier to recover from than blackouts." They are easier on the on-call engineer (no page) and harder on the customer (no signal). The recovery is also harder because brownouts can become permanent — a degraded mode that ships from a postmortem is rarely removed, even after the original cause is fixed, because removing it is risky and nobody notices it is still on. Audit your degradation modes quarterly; remove the ones that no longer correspond to a known failure case.
Going deeper
The Hystrix mental model — and why Netflix retired it
Netflix's Hystrix library (open-source 2012, retired 2018) was the canonical degradation framework: every dependency was wrapped in a HystrixCommand with a getFallback() method, a circuit breaker, a timeout, and a thread-pool isolation boundary. The fallback was the degradation mechanism — when the primary call timed out or the circuit was open, the fallback ran. Hystrix's contribution was the first-class concept of a fallback as part of the API, not a hack: every team that wrote a HystrixCommand was forced to think about getFallback() at design time. Netflix retired Hystrix in favour of concurrency-limits and resilience4j because the thread-pool-per-dependency model added 200–800 µs of overhead per call, a meaningful fraction of the budget in a tight RPC chain. The idea survived: every modern resilience library (resilience4j, polly, sentinel) has fallback as a primary primitive, and the Hystrix wiki remains the best 30-minute introduction to the production discipline of declaring fallbacks alongside primaries.
Google SRE on graceful degradation — error budgets and reliability tiers
Google's Site Reliability Engineering book (chapters 3 and 22 in particular) reframes degradation as an error budget question. The service has an SLO — say, 99.9% of requests succeed within 200 ms. The remaining 0.1% is the error budget. Degradation modes spend the budget with intent: instead of letting the budget be consumed by random 5xx failures, you spend it on intentional fallbacks (mode 1, mode 2) that the team has tested, monitored, and accepted as part of the contract. The advantage is that degradation paths get exercised — Google's load-balancer health-checks regularly route synthetic traffic into the degraded modes to verify they still work. This is the only way to keep the paths warm. Without this, the bug-bash phenomenon takes over: degradation paths only run during incidents, are full of bugs that nobody finds because they only fire under stress, and break the next time you need them.
"Reliable random" and shadow degradation
Some teams cannot easily synthesise traffic into degraded modes (e.g. the only way to trigger mode 1 is for the feature store to genuinely fail). The workaround is shadow degradation: 1% of production traffic is randomly routed into the degraded mode as a control group, the responses are computed, but the user still receives the mode-0 response. The shadow run validates that the degraded mode would have produced a valid answer; the user sees no impact. Discord has written publicly about this for their voice-server fallbacks. The cost is roughly 1% extra compute per dependency per shadow path; the benefit is that mode 1 is exercised every minute of every day in production, and a regression in the fallback path shows up as an immediate divergence between shadow and primary responses.
When degradation is the wrong answer — fail fast and let the layer above decide
Some services should not degrade — they should 503 and let the caller decide. A payment authorisation service has no useful degraded mode: a fallback that returns "auth approved without checking the issuing bank" is fraud, not degradation. PaySetu's payment-status RPC degrades in latency (longer p99 acceptable under load) but never in correctness — every status returned must come from the issuer's authoritative system, not a cache. The principle: degradation is for services where partial information is more useful than no information; for services where partial information is dangerous (auth, settlement, inventory reservation, medical records), 503 is the correct answer and the caller's higher-level logic should decide what to do.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip # asyncio ships with python 3.11+
# save degradation_demo.py from the article body
python3 degradation_demo.py
# Expected: phase 1 ~99% mode 0; phase 2 ~80% mode 1, ~10% each in modes 0/2.
# The fall-through pattern is visible in the few mode-1 results during phase 1.
Where this leads next
Degradation modes compose with the rest of the reliability stack:
- Load shedding strategies — chapter 47; degradation tries to give some answer; shedding gives no answer when even the cheapest degradation cannot be afforded. Together they cover the full quality–availability axis.
- Circuit breakers (Hystrix, Sentinel) — chapter 43; circuit breakers are the trigger for many degradation modes — when the breaker for a dependency opens, the service automatically falls back to the cache or the static default.
- Timeouts and deadline propagation — chapter 45; deadline propagation tells each layer how much budget it has, which feeds the deadline-budget triggers from §3 of this article.
- Bulkheads — chapter 44; bulkheads isolate failure domains so that one degraded class does not pull down others. A bulkhead-isolated mode-1 cache pool can absorb fall-through traffic without starving mode-0 callers.
- Hedged requests for the long tail — chapter 49 (next); hedging is the dual of degradation — instead of a worse answer when you cannot afford the dependency, send two requests to the dependency and take whichever finishes first. Both pay for tail latency in different currencies.
The order at the service: request enters → deadline check → mode selection → primary call (or fallback per mode) → response with x-degraded signal. Mode selection runs in front of the primary call so that the request never starts work it cannot finish. The signal goes back to the caller so the caller can make its own decision about what to do with a degraded response. Both ends of the contract are explicit, both ends are tested, both ends are observable.
References
- Brown et al., "Hystrix: Latency and Fault Tolerance for Distributed Systems" (Netflix Tech Blog 2012) — the canonical fallback-as-API library; archived but the wiki is still the best introduction.
- Beyer, Jones, Petoff, Murphy, Site Reliability Engineering (Google / O'Reilly 2016) — chapters 3, 22, and 25 on error budgets, graceful degradation, and how to spend the budget deliberately.
- Dean & Barroso, "The Tail at Scale" (CACM 2013) — completion deadlines, the budget-aware degradation pattern, and why p99 forces these decisions.
- resilience4j documentation — fallback module — modern JVM-side fallback library; the Hystrix successor.
- Discord engineering blog on shadow degradation testing — production patterns for keeping degradation paths exercised under normal load.
- Cloudflare blog: "Why we still build with Lua" (2017) — describes how Cloudflare's edge degrades when origin servers go away, the cache-as-truth pattern at scale.
- Hayashibara et al., "The φ Accrual Failure Detector" (SRDS 2004) — the failure-detector trigger that production degradation harnesses use; reading this changes how you set thresholds.
- Load shedding strategies — chapter 47; the partner pattern that handles the case where even mode-2 cannot keep up.