Timeouts and deadline propagation
PaySetu's payment-status RPC times out after 800 ms at the API gateway. The user has already closed the app. Inside the data centre, the request has hopped from gateway → router → payment-service → fraud-check → card-network-adapter → bank-rail. Each of those services has its own timeout — 750 ms, 700 ms, 600 ms, 500 ms, 400 ms — added "with a safety margin" by whoever wrote that service. None of them know how much time the request has actually spent in the queue ahead of them. At the moment the gateway gives up and returns 504 to the user, the bank-rail call is at 280 ms of its 400 ms budget and still going. The fraud-check is doing a database lookup against a stale read replica that will return in another 90 ms, then the card-network-adapter will compute its response, then the response will travel back up the call chain — landing 220 ms after the gateway has already given up. Six hundred CPU-milliseconds across five services were spent producing a result that nobody is waiting for. Multiply by 14 000 RPS during a Diwali peak and you are wasting 8.4 CPU-seconds per wall-second on doomed work — enough to push every service in the chain over its scaling threshold and trigger a cascade.
A timeout is the maximum local wait; a deadline is a wall-clock instant beyond which the work is useless and should be abandoned. Without deadline propagation, every hop budgets independently and downstream services keep computing answers nobody is waiting for. With deadline propagation, the outermost caller's deadline travels with the request through every hop; each hop subtracts the elapsed time, refuses calls that cannot finish in time, and short-circuits doomed work. gRPC, Tonic, and modern HTTP frameworks all expose this primitive — the work is using it correctly, not adding it.
Why a per-hop timeout is the wrong abstraction
A timeout is a local property: "I will wait at most 600 ms for this RPC to complete." That is sound for a single hop. The trouble starts the moment the call traverses more than one hop, because each subsequent hop has no way to know how much of the original time budget remains. The classic mistake is the timeout-cascade — each layer adds a "safety margin" so that the outer call times out before the inner one does:
gateway timeout 800 ms
└── payment-service timeout 750 ms
└── fraud-check timeout 700 ms
└── card-net timeout 600 ms
└── bank-rail timeout 500 ms
This pattern is well-meant — it is trying to ensure the outer caller does not see an inner timeout as a 500 error — but it is wrong in three ways at once:
-
Wasted budget on each hop. The outermost timeout is 800 ms; the inner-most service has only 500 ms to respond. The 300 ms gap is not "safety margin" — it is unused budget the inner service could have spent on retries or fallbacks. If anything, the inner-most service is the one that knows the most about its own latency profile and should be allowed to use the most of the budget, not the least.
-
The queue ahead of the inner call is invisible. Suppose the request waits 200 ms in the payment-service's request queue before the handler thread even starts. By the time the inner
bank-railcall begins, only 600 ms of the original 800 ms remains; the bank-rail's 500 ms timeout is now larger than the time the user is still willing to wait. The bank-rail can run for 500 ms and "succeed" — and the gateway has already returned 504 and disconnected. -
Doomed work continues after the user has given up. This is the most expensive failure. The gateway returns 504 to the user; the user closes the app; meanwhile the payment-service's handler thread is still parked inside the bank-rail RPC, the bank-rail is still talking to the bank, the bank has still not committed. CPU and database connections are spent producing a result that has no consumer.
Why "doomed work" is the metric that matters: in a typical microservice mesh, the cost of a request is paid roughly equally across every hop. If 30% of requests are doomed (the outer caller has already given up), then 30% of every backend's CPU, every database connection, every thread is doing work whose result will be discarded. This is exactly the work that pushes services into the queueing-saturation regime — the regime where every additional 1% of doomed work increases p99 latency by far more than 1%. Doomed work compounds.
What deadline propagation actually does
A deadline is a wall-clock instant — typically encoded as deadline_unix_nanos = now() + budget_ms. The outermost caller computes it once, writes it into the request context, and every hop reads it back. At each hop:
- Compute
remaining = deadline - now(). Ifremaining ≤ 0, fail fast — do not even start the work. - If the call about to be made is known to take longer than
remainingat p99, fail fast — do not start a doomed call. - Otherwise, set this hop's RPC timeout to
min(local_max_timeout, remaining - reserve)wherereserveis a small slack (10–50 ms) to allow the response to travel back. - Pass the same deadline (not a recomputed one) to any further downstream calls.
The critical part is step 4: the deadline does not get re-budgeted at each hop. It is a constant — the wall-clock instant the user gave up. Every hop reads it, every hop subtracts its own elapsed wait time from it, but no hop adds to it. This is what stops the cascade of safety margins.
gRPC has had this since 2016 — the grpc.WithTimeout (now grpc.WithDeadline) is encoded as a grpc-timeout header that propagates through every hop on the wire. Tonic (Rust gRPC) has it. Java's gRPC, Python's grpc.aio, Go's context.Context with WithDeadline. HTTP/2 and HTTP/3 have draft headers for it. Why this is a trace-context-shaped problem: the deadline is conceptually the same kind of thing as a trace-id — request-scoped state that has to traverse every hop, including hops that don't know about it. Anywhere your tracing system propagates a traceparent header, the same machinery should propagate a deadline header. If your service mesh propagates the trace and not the deadline, you have done half the work.
Building it — a working deadline-propagation context
# deadline_propagation.py — context-carried deadline through 4 hops, with abandonment
import time, asyncio, random
class Deadline:
def __init__(self, budget_ms):
self.deadline = time.monotonic() + budget_ms / 1000.0
def remaining_ms(self):
return max(0.0, (self.deadline - time.monotonic()) * 1000.0)
def expired(self):
return time.monotonic() >= self.deadline
def child_timeout(self, reserve_ms=20):
# reserve a small slack for response travel back up the call chain
return max(0.0, self.remaining_ms() - reserve_ms)
class DeadlineExceeded(Exception): pass
async def rpc(name, p50_ms, p99_ms, dl: Deadline):
if dl.expired():
raise DeadlineExceeded(f"{name} not started — deadline already past")
budget = dl.child_timeout()
if budget < p50_ms:
# known too slow even at p50 — abandon early instead of guessing
raise DeadlineExceeded(f"{name} skipped — only {budget:.0f}ms left, p50 is {p50_ms}ms")
actual = random.uniform(p50_ms * 0.7, p99_ms)
try:
await asyncio.wait_for(asyncio.sleep(actual / 1000.0), timeout=budget / 1000.0)
return f"{name} ok in {actual:.0f}ms (had {budget:.0f}ms)"
except asyncio.TimeoutError:
raise DeadlineExceeded(f"{name} timed out at {budget:.0f}ms")
async def gateway_request(total_ms):
dl = Deadline(total_ms)
try:
await rpc("payment-service", p50_ms=40, p99_ms=180, dl=dl)
await rpc("fraud-check", p50_ms=30, p99_ms=140, dl=dl)
await rpc("card-net", p50_ms=80, p99_ms=320, dl=dl)
await rpc("bank-rail", p50_ms=120, p99_ms=480, dl=dl)
return "OK"
except DeadlineExceeded as e:
return f"ABANDONED: {e} (remaining={dl.remaining_ms():.0f}ms)"
async def main():
random.seed(7)
for budget in (1500, 800, 400, 250):
out = await gateway_request(budget)
print(f"budget={budget}ms -> {out}")
asyncio.run(main())
Sample run on Python 3.11:
budget=1500ms -> OK
budget=800ms -> OK
budget=400ms -> ABANDONED: bank-rail skipped — only 102ms left, p50 is 120ms (remaining=102ms)
budget=250ms -> ABANDONED: card-net skipped — only 117ms left, p50 is 80ms (remaining=117ms)
Per-line walkthrough. Deadline(budget_ms) captures monotonic() + budget exactly once at the top of the request — every hop reads from this same object. remaining_ms() computes how much wall-clock budget is left right now; this is what every downstream call uses to size its own timeout. child_timeout(reserve_ms=20) returns remaining - reserve so 20 ms is left for the response to come back — without this slack, the inner-most call uses every microsecond of the budget and the outer caller times out before the response arrives. The if budget < p50_ms check is the most important line: it is the abandon-doomed-work heuristic. If the remaining budget is less than the p50 latency of the next call, the call is more than 50% likely to miss the deadline; refusing to start it saves the CPU and connection that would have been spent producing a doomed answer. Read the result: at 1500 ms and 800 ms budgets, all four hops complete; at 400 ms, the bank-rail (p50=120 ms) is correctly skipped because only 102 ms remain; at 250 ms, the card-net is skipped earlier in the chain. Why "abandon when remaining < p50" beats "let the call run and time out": if you let it run, the call will probably time out anyway, and meanwhile it will hold a connection-pool slot, a thread, a database lock, and a tracing context. The cost of a timed-out call is paid in full — you get the resource consumption and the failure. Abandoning at p50 turns a probably-doomed call into a definitely-skipped call, with the saved resources available for the next request.
Where the reserve goes — and why a single deadline can still drift
The deadline-propagation contract has one subtle assumption: that the wall-clock cost of propagating the deadline is small relative to the deadline itself. This holds when the network RTT between hops is in the millisecond range and the deadline is hundreds of milliseconds. It breaks in two places:
Cross-region calls. A 200 ms cross-region RTT eats half the budget on the outbound leg alone. Reserve calculations have to account for the return leg too: if the caller's deadline is 800 ms and the cross-region one-way is 200 ms, the inner hop has 800 − 200 − 200 = 400 ms to actually do the work. Sized-for-LAN reserves (20 ms) silently turn into negative-time budgets in cross-region calls. The fix is to subtract a measured RTT, not a fixed reserve.
Clock skew. If hop A and hop B have wall-clocks that disagree by 50 ms, and the deadline is encoded as a wall-clock timestamp, hop B's view of the remaining budget is off by 50 ms. The fix is to encode the deadline as a duration from now (the gRPC choice — grpc-timeout: 800m means "this many milliseconds from when you receive this header"), not as an absolute timestamp. The duration form is naturally clock-skew-immune because each hop computes its own absolute deadline using its own clock the moment the request arrives.
Synchronous fan-out. The whole "subtract elapsed" model assumes a linear call chain. When a service makes 5 parallel calls, all 5 share the same remaining budget; they do not get to spend it independently. If you naively pass the same Deadline object to each, you get the right behaviour — they each see "300 ms remaining" and each cap their call at 280 ms. The wrong implementation hands each fanout a fresh 300 ms budget, which is fine until one of them calls something that takes 290 ms, and now the response window for collecting the results is gone.
Async retries inside the budget. Retries (chapter 42) must run inside the deadline, not outside it. A naive retry policy — "retry up to 3 times with 100 ms backoff" — silently triples the wall-clock budget if the caller's deadline is not consulted between attempts. The right pattern is: each retry checks dl.remaining_ms() before scheduling; if the next attempt's expected p50 exceeds remaining, the retry is skipped and the failure is returned immediately. This is the "retry budget" idea — the deadline is the budget; the retry policy is a strategy for spending it.
Common confusions
-
"A timeout and a deadline are the same thing." They are not. A timeout is a duration ("wait at most 800 ms from now"); a deadline is a wall-clock instant ("give up at 16:42:01.847 IST"). The deadline is propagation-friendly — every hop reads the same value and computes its own remaining budget. The timeout is not — every hop has to recompute it locally and the recomputation has no shared anchor. Production systems use timeouts internally (because that is what
selectandwait_foraccept) but propagate deadlines on the wire (because that is what survives hop boundaries). -
"gRPC's
WithTimeoutalready does deadline propagation." Yes — gRPC'sWithDeadline(and the olderWithTimeoutwhich callsWithDeadlineinternally) writes agrpc-timeoutheader that the receiving server reads, computes its own absolute deadline, and propagates further down. The work for an application engineer is using it correctly: starting the deadline at the outer boundary (gateway, not first internal service), and not overwriting it at intermediate hops. The pattern that breaks deadline propagation isctx, cancel := context.WithTimeout(parentCtx, 500*time.Millisecond)inside an internal service — that replaces the deadline with a fresh 500 ms instead of inheriting the parent's. -
"Set every internal service's timeout to be a little less than the caller's." This is the timeout-cascade anti-pattern. It tries to prevent inner timeouts from propagating to the outer caller, but it cannot account for queue-wait time and produces wasted budgets at every hop. The correct pattern is: set local timeouts to the upper bound of what the service can ever do (e.g. "no RPC may exceed 5 seconds, ever"), and let deadline propagation pick the actual operative timeout from the deadline.
-
"Deadline propagation is a service-mesh concern." A service mesh can carry the deadline header for you (Istio, Linkerd, Consul Connect all do), but the abandonment logic — checking
dl.expired()before doing work, refusing to start calls whose p50 exceeds the remaining budget — is application code. The mesh propagates; the application enforces. A mesh-only deadline propagation will still let a service compute a 480 ms answer that the user has already given up on, because the mesh can't decide whether your business logic is worth running for the remaining 100 ms. -
"Cancellation and deadline are the same thing." Closely related but not the same. A deadline is time-based abandonment; a cancellation is external-event abandonment ("the user pressed Cancel", "the parent request failed and there's no point continuing"). Both produce the same effect — abandon the in-flight work — and both should plumb the same
context.Context/CancellationToken. But a request can be cancelled before the deadline (parent failed) or run past the deadline without explicit cancellation (deadline-only). The reliable pattern is: every long-running operation listens to both signals from the same context. -
"The deadline header should be an absolute timestamp." This is the trap that makes deadline propagation clock-skew-fragile. Encode it as a duration (
grpc-timeout: 800m— "800 milliseconds from when you receive this") and let each hop convert to an absolute monotonic deadline using its own clock the moment it parses the header. Absolute-timestamp encodings break silently on the day NTP drifts by 50 ms, and they break catastrophically when virtual machines pause and resume across time-sync boundaries.
Going deeper
The Google Stubby legacy — where deadline propagation came from
Google's internal RPC system Stubby (the predecessor of gRPC) had deadline propagation from its earliest deployments because Google's production environment hit the doomed-work problem at a scale nobody else had: a single Search query at peak fanned out to ~1000 backend servers, and even a 1% rate of doomed work at every hop cumulatively wasted megawatts of compute. The deadline-propagation primitive was originally designed by Mike Burrows's group around the same time as Chubby (2006) for similar reasons — at fan-out scale, "every hop budgets independently" becomes catastrophically wasteful. When gRPC was open-sourced in 2015, deadline propagation was not a new feature — it was the most-load-bearing battle-tested primitive being externalised. Jeff Dean's Tail at Scale paper (CACM 2013) argues for it explicitly: "even when each component meets its own latency SLO, services must aggressively shed load when the upstream caller has indicated they will not wait for the response."
The right place to start the deadline — the user-facing edge, not the first internal service
A common mistake is to start the deadline at the first internal service rather than at the user-facing edge. KapitalKite had this for two years: the API gateway was treated as a pass-through (no deadline), and the first internal service (order-service) created the deadline. The result was that any time spent in the gateway — TLS handshake, auth check, rate-limit check, request body parsing — was invisible to the deadline. During market open, the gateway's own queue depth grew to 200 ms; the order-service started its 800 ms deadline only after the gateway's 200 ms was already gone, but the user had been waiting 1000 ms by then. Fix: the gateway started the deadline immediately on accept, which exposed the gateway's own queueing as ≈25% of the request budget at peak. The CEO-facing dashboard moved from "order-service p99 = 850 ms" to "gateway-to-bank p99 = 1100 ms" — the same wall-clock latency, now correctly attributed.
context.Context in Go is not magic — it works because every hop opts in
Go's context.Context has become almost a religious convention in the ecosystem: every function takes ctx as the first argument, every blocking call respects ctx.Done(). It looks like the language is doing the work, but it isn't — the language has no special knowledge of contexts. The whole pattern works because every standard library function and every reasonable third-party library opts in. database/sql checks ctx.Done() between query execution and result fetching. net/http cancels in-flight requests when ctx.Done() fires. grpc-go checks the deadline before sending the request. The pattern only works if you opt in too — os.ReadFile does not take a context, and a os.ReadFile of a 200 GB file in your hot path will run to completion regardless of the deadline because it has no way to listen. The deadline is only as good as the operation that listens for it. Why this matters operationally: any time you write a function that does I/O without taking a ctx (or Deadline), you have inserted a non-cancellable region into the call chain. The deadline that propagates around your function is not propagating through your function. The first sign of this in production is a long-tail of requests that complete after the deadline because they were stuck in such a region; the fix is always "thread the context through".
CricStream's promo-engine outage — the "missing 80 ms" post-mortem
CricStream's match-page service had deadline propagation enabled in 2025 but missed one hop: the legacy ad-decision service still wrote to MySQL via a JDBC driver that did not respect ctx.Done(). During a final at 19:42 IST, the MySQL primary had a brief 1.4 s replication-lag spike. The ad-decision service's deadline-aware code saw 200 ms of remaining budget and called the JDBC executeQuery — which then ran for 1.4 s, ignoring the deadline entirely because the driver lacked cancellation. Every other hop did abandon at the deadline; only the JDBC layer kept running. The match-page service's threads piled up inside the JDBC call until the bulkhead saturated; the cascade reached the gateway in 11 seconds. The post-mortem identified the root cause as "deadline propagation works only as well as the slowest non-cancellable operation in the chain". The fix was the MariaDB Connector/J upgrade (which respects Statement.setQueryTimeout() and the calling thread's interrupt) plus a per-query upper-bound timeout that the deadline could clamp downward. Total user-visible impact: 4.6 minutes of degraded match-page rendering for ~3.1 M concurrent viewers; ad revenue loss was estimated at ₹84 lakh. The lesson the SRE team distilled: audit every I/O library in the call chain for cancellation support; one non-cancellable operation defeats the whole pattern.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install asyncio # bundled, but make sure
# save deadline_propagation.py from the article body
python3 deadline_propagation.py
# Expected: 1500ms and 800ms budgets succeed; 400ms and 250ms abandon early.
# Try setting the seed to other values; the abandonment point shifts because
# upstream hops sometimes take longer, leaving less for the bank-rail.
Where this leads next
Deadline propagation is the spine of the failure-isolation stack. Every other reliability pattern in this section depends on it:
- Retries with exponential backoff and jitter — chapter 42; without a deadline, retries multiply doomed work. Retries should run inside the deadline, not outside it.
- Circuit breakers (Hystrix, Sentinel) — chapter 43; the breaker says "stop calling sick backends", the deadline says "stop computing answers nobody wants".
- Bulkheads — chapter 44; bulkheads cap the in-flight count, deadlines bound the time each in-flight call can hold a permit.
- Hedged requests for the long tail — chapter 47; hedged requests share the deadline with the original call, not double it.
- Idempotency, exactly-once myth, idempotent receivers — chapter 46; abandoning doomed work only works if the receiver is idempotent under partial completion.
- Distributed tracing — spans, baggage, sampling — chapter 132; the same propagation machinery carries both the trace context and the deadline.
The composition is deadline → bulkhead → breaker → retry → timeout → RPC. The deadline is outermost because every other layer needs to see how much budget is left. A bulkhead permit released early because of an expired deadline; a circuit breaker that does not even attempt the half-open probe call when remaining < p50; a retry that skips its second attempt because the budget cannot fit it — these are all small wins on a single request, and large wins at fleet scale during peak load.
A senior PaySetu SRE summarised the discipline this way: every request enters the data centre carrying a clock; every service it touches reads the clock and decides whether the work it is about to do will be useful when finished. The clock is the deadline. Every operation in the call chain is a vote on whether to keep spending against it.
References
- Jeff Dean & Luiz André Barroso, "The Tail at Scale" (CACM 2013) — the canonical case for deadline-propagation at fanout scale; argues that aggregate latency at high fan-out is dominated by the slowest backend unless the system can shed doomed work.
- gRPC documentation — Deadlines — the operative guide to using
WithDeadlinecorrectly across languages; covers thegrpc-timeoutheader wire format. - Go
contextpackage documentation — the canonical implementation pattern for deadline-bearing request contexts. - Sam Boyer, "Context isn't for cancellation" (2017) — the often-cited push-back that clarifies what
contextshould and should not carry. - Cindy Sridharan, "Distributed Tracing in Practice" (O'Reilly, 2020) — chapter on context propagation; deadline travels with the trace.
- AWS Builders' Library — "Timeouts, retries and backoff with jitter" — production-side write-up; argues for deadline-as-duration on the wire to avoid clock-skew issues.
- Bulkheads — chapter 44; the partner pattern that caps the in-flight count.
- Circuit breakers (Hystrix, Sentinel) — chapter 43; deadlines abandon doomed work, breakers stop sending calls to sick backends.