Timeouts and deadline propagation

PaySetu's payment-status RPC times out after 800 ms at the API gateway. The user has already closed the app. Inside the data centre, the request has hopped from gateway → router → payment-service → fraud-check → card-network-adapter → bank-rail. Each of those services has its own timeout — 750 ms, 700 ms, 600 ms, 500 ms, 400 ms — added "with a safety margin" by whoever wrote that service. None of them know how much time the request has actually spent in the queue ahead of them. At the moment the gateway gives up and returns 504 to the user, the bank-rail call is at 280 ms of its 400 ms budget and still going. The fraud-check is doing a database lookup against a stale read replica that will return in another 90 ms, then the card-network-adapter will compute its response, then the response will travel back up the call chain — landing 220 ms after the gateway has already given up. Six hundred CPU-milliseconds across five services were spent producing a result that nobody is waiting for. Multiply by 14 000 RPS during a Diwali peak and you are wasting 8.4 CPU-seconds per wall-second on doomed work — enough to push every service in the chain over its scaling threshold and trigger a cascade.

A timeout is the maximum local wait; a deadline is a wall-clock instant beyond which the work is useless and should be abandoned. Without deadline propagation, every hop budgets independently and downstream services keep computing answers nobody is waiting for. With deadline propagation, the outermost caller's deadline travels with the request through every hop; each hop subtracts the elapsed time, refuses calls that cannot finish in time, and short-circuits doomed work. gRPC, Tonic, and modern HTTP frameworks all expose this primitive — the work is using it correctly, not adding it.

Why a per-hop timeout is the wrong abstraction

A timeout is a local property: "I will wait at most 600 ms for this RPC to complete." That is sound for a single hop. The trouble starts the moment the call traverses more than one hop, because each subsequent hop has no way to know how much of the original time budget remains. The classic mistake is the timeout-cascade — each layer adds a "safety margin" so that the outer call times out before the inner one does:

gateway     timeout 800 ms
 └── payment-service     timeout 750 ms
      └── fraud-check    timeout 700 ms
           └── card-net  timeout 600 ms
                └── bank-rail   timeout 500 ms

This pattern is well-meant — it is trying to ensure the outer caller does not see an inner timeout as a 500 error — but it is wrong in three ways at once:

  1. Wasted budget on each hop. The outermost timeout is 800 ms; the inner-most service has only 500 ms to respond. The 300 ms gap is not "safety margin" — it is unused budget the inner service could have spent on retries or fallbacks. If anything, the inner-most service is the one that knows the most about its own latency profile and should be allowed to use the most of the budget, not the least.

  2. The queue ahead of the inner call is invisible. Suppose the request waits 200 ms in the payment-service's request queue before the handler thread even starts. By the time the inner bank-rail call begins, only 600 ms of the original 800 ms remains; the bank-rail's 500 ms timeout is now larger than the time the user is still willing to wait. The bank-rail can run for 500 ms and "succeed" — and the gateway has already returned 504 and disconnected.

  3. Doomed work continues after the user has given up. This is the most expensive failure. The gateway returns 504 to the user; the user closes the app; meanwhile the payment-service's handler thread is still parked inside the bank-rail RPC, the bank-rail is still talking to the bank, the bank has still not committed. CPU and database connections are spent producing a result that has no consumer.

Why "doomed work" is the metric that matters: in a typical microservice mesh, the cost of a request is paid roughly equally across every hop. If 30% of requests are doomed (the outer caller has already given up), then 30% of every backend's CPU, every database connection, every thread is doing work whose result will be discarded. This is exactly the work that pushes services into the queueing-saturation regime — the regime where every additional 1% of doomed work increases p99 latency by far more than 1%. Doomed work compounds.

Per-hop timeouts vs deadline propagation — a 5-hop callTwo stacked timelines. Top timeline shows per-hop timeouts: each hop has its own bar, with safety margins shrinking from 800ms at the gateway to 500ms at the bank-rail; queue waits at each hop eat into the budget invisibly; the gateway times out at 800ms but the inner hops are still running and don't know. Bottom timeline shows deadline propagation: a single deadline at 800ms is carried in the request context; every hop subtracts elapsed time and computes its remaining budget; the inner hops see the budget shrink and abandon early when the deadline is reached. Per-hop timeouts vs deadline propagation Per-hop timeouts (each hop sets its own) gateway 800 ms payment 750 ms (50 ms queue + 30 ms handler before next hop) fraud 700 ms — sees 700, has only ≈600 left in user-wallclock card-net 600 ms — sees 600, has only ≈500 left bank-rail 500 ms — has 280, runs to "success" 220 ms after user gave up user disconnects (gateway 504) bank-rail still running Deadline propagation (one deadline, every hop subtracts elapsed) gateway: deadline = now + 800 ms — written into request context payment receives deadline; remaining = 720 ms after queue + handler fraud sees remaining = 660 ms; refuses if its own minimum > remaining card-net sees 600 ms remaining; abandons early if exceeded bank-rail sees 280 ms — abandons rather than running 500 ms doomed t=0 t=800ms
Illustrative — not measured. Top: per-hop timeouts let inner hops keep running after the user has disconnected. Bottom: deadline propagation gives every hop the same wall-clock deadline; each hop computes its remaining budget and abandons doomed work.
Doomed-work fraction grows non-linearly with offered RPSA line chart with x-axis offered RPS from 0 to 16000 and y-axis doomed-work fraction percent from 0 to 50. Two curves: top curve labelled "without deadline propagation" rises slowly to about 5 percent at 8000 RPS then steeply to 42 percent at 14000 RPS as queue waits push more requests past their wall-clock budget. Bottom curve labelled "with deadline propagation" stays under 2 percent even at 14000 RPS because doomed work is abandoned at hop boundaries. A vertical band marks the cascade-trigger zone where without-propagation crosses 20 percent. Doomed-work fraction vs offered RPS — PaySetu's 5-hop checkout call 50% 25% 0% 0 4k RPS 8k 12k 16k cascade zone no propagation: 42% doomed with propagation: ≈2% 12k RPS — gateway p99 hits 800ms
Illustrative — not measured data. Without deadline propagation, the doomed-work fraction stays low until queues start to grow at every hop, then rises rapidly as upstream waits push more requests past their wall-clock budget. With propagation, hops abandon doomed work at the boundary, keeping the fraction near zero across the load range.

What deadline propagation actually does

A deadline is a wall-clock instant — typically encoded as deadline_unix_nanos = now() + budget_ms. The outermost caller computes it once, writes it into the request context, and every hop reads it back. At each hop:

  1. Compute remaining = deadline - now(). If remaining ≤ 0, fail fast — do not even start the work.
  2. If the call about to be made is known to take longer than remaining at p99, fail fast — do not start a doomed call.
  3. Otherwise, set this hop's RPC timeout to min(local_max_timeout, remaining - reserve) where reserve is a small slack (10–50 ms) to allow the response to travel back.
  4. Pass the same deadline (not a recomputed one) to any further downstream calls.

The critical part is step 4: the deadline does not get re-budgeted at each hop. It is a constant — the wall-clock instant the user gave up. Every hop reads it, every hop subtracts its own elapsed wait time from it, but no hop adds to it. This is what stops the cascade of safety margins.

gRPC has had this since 2016 — the grpc.WithTimeout (now grpc.WithDeadline) is encoded as a grpc-timeout header that propagates through every hop on the wire. Tonic (Rust gRPC) has it. Java's gRPC, Python's grpc.aio, Go's context.Context with WithDeadline. HTTP/2 and HTTP/3 have draft headers for it. Why this is a trace-context-shaped problem: the deadline is conceptually the same kind of thing as a trace-id — request-scoped state that has to traverse every hop, including hops that don't know about it. Anywhere your tracing system propagates a traceparent header, the same machinery should propagate a deadline header. If your service mesh propagates the trace and not the deadline, you have done half the work.

Building it — a working deadline-propagation context

# deadline_propagation.py — context-carried deadline through 4 hops, with abandonment
import time, asyncio, random

class Deadline:
    def __init__(self, budget_ms):
        self.deadline = time.monotonic() + budget_ms / 1000.0
    def remaining_ms(self):
        return max(0.0, (self.deadline - time.monotonic()) * 1000.0)
    def expired(self):
        return time.monotonic() >= self.deadline
    def child_timeout(self, reserve_ms=20):
        # reserve a small slack for response travel back up the call chain
        return max(0.0, self.remaining_ms() - reserve_ms)

class DeadlineExceeded(Exception): pass

async def rpc(name, p50_ms, p99_ms, dl: Deadline):
    if dl.expired():
        raise DeadlineExceeded(f"{name} not started — deadline already past")
    budget = dl.child_timeout()
    if budget < p50_ms:
        # known too slow even at p50 — abandon early instead of guessing
        raise DeadlineExceeded(f"{name} skipped — only {budget:.0f}ms left, p50 is {p50_ms}ms")
    actual = random.uniform(p50_ms * 0.7, p99_ms)
    try:
        await asyncio.wait_for(asyncio.sleep(actual / 1000.0), timeout=budget / 1000.0)
        return f"{name} ok in {actual:.0f}ms (had {budget:.0f}ms)"
    except asyncio.TimeoutError:
        raise DeadlineExceeded(f"{name} timed out at {budget:.0f}ms")

async def gateway_request(total_ms):
    dl = Deadline(total_ms)
    try:
        await rpc("payment-service", p50_ms=40,  p99_ms=180, dl=dl)
        await rpc("fraud-check",     p50_ms=30,  p99_ms=140, dl=dl)
        await rpc("card-net",        p50_ms=80,  p99_ms=320, dl=dl)
        await rpc("bank-rail",       p50_ms=120, p99_ms=480, dl=dl)
        return "OK"
    except DeadlineExceeded as e:
        return f"ABANDONED: {e}  (remaining={dl.remaining_ms():.0f}ms)"

async def main():
    random.seed(7)
    for budget in (1500, 800, 400, 250):
        out = await gateway_request(budget)
        print(f"budget={budget}ms -> {out}")

asyncio.run(main())

Sample run on Python 3.11:

budget=1500ms -> OK
budget=800ms  -> OK
budget=400ms  -> ABANDONED: bank-rail skipped — only 102ms left, p50 is 120ms  (remaining=102ms)
budget=250ms  -> ABANDONED: card-net skipped — only 117ms left, p50 is 80ms  (remaining=117ms)

Per-line walkthrough. Deadline(budget_ms) captures monotonic() + budget exactly once at the top of the request — every hop reads from this same object. remaining_ms() computes how much wall-clock budget is left right now; this is what every downstream call uses to size its own timeout. child_timeout(reserve_ms=20) returns remaining - reserve so 20 ms is left for the response to come back — without this slack, the inner-most call uses every microsecond of the budget and the outer caller times out before the response arrives. The if budget < p50_ms check is the most important line: it is the abandon-doomed-work heuristic. If the remaining budget is less than the p50 latency of the next call, the call is more than 50% likely to miss the deadline; refusing to start it saves the CPU and connection that would have been spent producing a doomed answer. Read the result: at 1500 ms and 800 ms budgets, all four hops complete; at 400 ms, the bank-rail (p50=120 ms) is correctly skipped because only 102 ms remain; at 250 ms, the card-net is skipped earlier in the chain. Why "abandon when remaining < p50" beats "let the call run and time out": if you let it run, the call will probably time out anyway, and meanwhile it will hold a connection-pool slot, a thread, a database lock, and a tracing context. The cost of a timed-out call is paid in full — you get the resource consumption and the failure. Abandoning at p50 turns a probably-doomed call into a definitely-skipped call, with the saved resources available for the next request.

Where the reserve goes — and why a single deadline can still drift

The deadline-propagation contract has one subtle assumption: that the wall-clock cost of propagating the deadline is small relative to the deadline itself. This holds when the network RTT between hops is in the millisecond range and the deadline is hundreds of milliseconds. It breaks in two places:

Cross-region calls. A 200 ms cross-region RTT eats half the budget on the outbound leg alone. Reserve calculations have to account for the return leg too: if the caller's deadline is 800 ms and the cross-region one-way is 200 ms, the inner hop has 800 − 200 − 200 = 400 ms to actually do the work. Sized-for-LAN reserves (20 ms) silently turn into negative-time budgets in cross-region calls. The fix is to subtract a measured RTT, not a fixed reserve.

Clock skew. If hop A and hop B have wall-clocks that disagree by 50 ms, and the deadline is encoded as a wall-clock timestamp, hop B's view of the remaining budget is off by 50 ms. The fix is to encode the deadline as a duration from now (the gRPC choice — grpc-timeout: 800m means "this many milliseconds from when you receive this header"), not as an absolute timestamp. The duration form is naturally clock-skew-immune because each hop computes its own absolute deadline using its own clock the moment the request arrives.

Synchronous fan-out. The whole "subtract elapsed" model assumes a linear call chain. When a service makes 5 parallel calls, all 5 share the same remaining budget; they do not get to spend it independently. If you naively pass the same Deadline object to each, you get the right behaviour — they each see "300 ms remaining" and each cap their call at 280 ms. The wrong implementation hands each fanout a fresh 300 ms budget, which is fine until one of them calls something that takes 290 ms, and now the response window for collecting the results is gone.

Async retries inside the budget. Retries (chapter 42) must run inside the deadline, not outside it. A naive retry policy — "retry up to 3 times with 100 ms backoff" — silently triples the wall-clock budget if the caller's deadline is not consulted between attempts. The right pattern is: each retry checks dl.remaining_ms() before scheduling; if the next attempt's expected p50 exceeds remaining, the retry is skipped and the failure is returned immediately. This is the "retry budget" idea — the deadline is the budget; the retry policy is a strategy for spending it.

Common confusions

Going deeper

The Google Stubby legacy — where deadline propagation came from

Google's internal RPC system Stubby (the predecessor of gRPC) had deadline propagation from its earliest deployments because Google's production environment hit the doomed-work problem at a scale nobody else had: a single Search query at peak fanned out to ~1000 backend servers, and even a 1% rate of doomed work at every hop cumulatively wasted megawatts of compute. The deadline-propagation primitive was originally designed by Mike Burrows's group around the same time as Chubby (2006) for similar reasons — at fan-out scale, "every hop budgets independently" becomes catastrophically wasteful. When gRPC was open-sourced in 2015, deadline propagation was not a new feature — it was the most-load-bearing battle-tested primitive being externalised. Jeff Dean's Tail at Scale paper (CACM 2013) argues for it explicitly: "even when each component meets its own latency SLO, services must aggressively shed load when the upstream caller has indicated they will not wait for the response."

The right place to start the deadline — the user-facing edge, not the first internal service

A common mistake is to start the deadline at the first internal service rather than at the user-facing edge. KapitalKite had this for two years: the API gateway was treated as a pass-through (no deadline), and the first internal service (order-service) created the deadline. The result was that any time spent in the gateway — TLS handshake, auth check, rate-limit check, request body parsing — was invisible to the deadline. During market open, the gateway's own queue depth grew to 200 ms; the order-service started its 800 ms deadline only after the gateway's 200 ms was already gone, but the user had been waiting 1000 ms by then. Fix: the gateway started the deadline immediately on accept, which exposed the gateway's own queueing as ≈25% of the request budget at peak. The CEO-facing dashboard moved from "order-service p99 = 850 ms" to "gateway-to-bank p99 = 1100 ms" — the same wall-clock latency, now correctly attributed.

context.Context in Go is not magic — it works because every hop opts in

Go's context.Context has become almost a religious convention in the ecosystem: every function takes ctx as the first argument, every blocking call respects ctx.Done(). It looks like the language is doing the work, but it isn't — the language has no special knowledge of contexts. The whole pattern works because every standard library function and every reasonable third-party library opts in. database/sql checks ctx.Done() between query execution and result fetching. net/http cancels in-flight requests when ctx.Done() fires. grpc-go checks the deadline before sending the request. The pattern only works if you opt in too — os.ReadFile does not take a context, and a os.ReadFile of a 200 GB file in your hot path will run to completion regardless of the deadline because it has no way to listen. The deadline is only as good as the operation that listens for it. Why this matters operationally: any time you write a function that does I/O without taking a ctx (or Deadline), you have inserted a non-cancellable region into the call chain. The deadline that propagates around your function is not propagating through your function. The first sign of this in production is a long-tail of requests that complete after the deadline because they were stuck in such a region; the fix is always "thread the context through".

CricStream's promo-engine outage — the "missing 80 ms" post-mortem

CricStream's match-page service had deadline propagation enabled in 2025 but missed one hop: the legacy ad-decision service still wrote to MySQL via a JDBC driver that did not respect ctx.Done(). During a final at 19:42 IST, the MySQL primary had a brief 1.4 s replication-lag spike. The ad-decision service's deadline-aware code saw 200 ms of remaining budget and called the JDBC executeQuery — which then ran for 1.4 s, ignoring the deadline entirely because the driver lacked cancellation. Every other hop did abandon at the deadline; only the JDBC layer kept running. The match-page service's threads piled up inside the JDBC call until the bulkhead saturated; the cascade reached the gateway in 11 seconds. The post-mortem identified the root cause as "deadline propagation works only as well as the slowest non-cancellable operation in the chain". The fix was the MariaDB Connector/J upgrade (which respects Statement.setQueryTimeout() and the calling thread's interrupt) plus a per-query upper-bound timeout that the deadline could clamp downward. Total user-visible impact: 4.6 minutes of degraded match-page rendering for ~3.1 M concurrent viewers; ad revenue loss was estimated at ₹84 lakh. The lesson the SRE team distilled: audit every I/O library in the call chain for cancellation support; one non-cancellable operation defeats the whole pattern.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install asyncio   # bundled, but make sure
# save deadline_propagation.py from the article body
python3 deadline_propagation.py
# Expected: 1500ms and 800ms budgets succeed; 400ms and 250ms abandon early.
# Try setting the seed to other values; the abandonment point shifts because
# upstream hops sometimes take longer, leaving less for the bank-rail.

Where this leads next

Deadline propagation is the spine of the failure-isolation stack. Every other reliability pattern in this section depends on it:

The composition is deadline → bulkhead → breaker → retry → timeout → RPC. The deadline is outermost because every other layer needs to see how much budget is left. A bulkhead permit released early because of an expired deadline; a circuit breaker that does not even attempt the half-open probe call when remaining < p50; a retry that skips its second attempt because the budget cannot fit it — these are all small wins on a single request, and large wins at fleet scale during peak load.

A senior PaySetu SRE summarised the discipline this way: every request enters the data centre carrying a clock; every service it touches reads the clock and decides whether the work it is about to do will be useful when finished. The clock is the deadline. Every operation in the call chain is a vote on whether to keep spending against it.

References

  1. Jeff Dean & Luiz André Barroso, "The Tail at Scale" (CACM 2013) — the canonical case for deadline-propagation at fanout scale; argues that aggregate latency at high fan-out is dominated by the slowest backend unless the system can shed doomed work.
  2. gRPC documentation — Deadlines — the operative guide to using WithDeadline correctly across languages; covers the grpc-timeout header wire format.
  3. Go context package documentation — the canonical implementation pattern for deadline-bearing request contexts.
  4. Sam Boyer, "Context isn't for cancellation" (2017) — the often-cited push-back that clarifies what context should and should not carry.
  5. Cindy Sridharan, "Distributed Tracing in Practice" (O'Reilly, 2020) — chapter on context propagation; deadline travels with the trace.
  6. AWS Builders' Library — "Timeouts, retries and backoff with jitter" — production-side write-up; argues for deadline-as-duration on the wire to avoid clock-skew issues.
  7. Bulkheads — chapter 44; the partner pattern that caps the in-flight count.
  8. Circuit breakers (Hystrix, Sentinel) — chapter 43; deadlines abandon doomed work, breakers stop sending calls to sick backends.