Deadlines and deadline propagation

It is 8:47 pm at PaySetu and the dashboard for the Checkout.Pay RPC shows two numbers that should not both be true: p99 latency is 180 ms — well inside the 200 ms front-door budget — and the database CPU is pegged at 94%. Aditi pulls a flame graph from the payments-DB primary and sees what is happening: a long tail of risk-scoring queries are running for 6, 11, 17 seconds after the front-door RPC has already returned a DEADLINE_EXCEEDED error to the merchant. The user gave up. The merchant SDK retried. The original query kept burning the database. Multiply that by 4,200 retries per second during a Diwali sale and the database is doing the same expensive work three times — once for the failed request, once for the retry, and once more, gratuitously, for the abandoned tail.

A deadline is a single absolute timestamp (8:47:12.430 PM UTC) that travels with a request through every hop, telling each callee how long it has left to finish before the caller has already given up. A timeout is a duration the local hop applies; a deadline is the global budget. Without propagation, deep call trees waste compute on work nobody is waiting for; with propagation, every hop knows when to stop. The discipline is: pick an absolute time at the front door, send it on every outbound RPC, refuse to start work that cannot finish in the time remaining.

Timeout vs deadline — the same number from a different reference frame

A timeout is a duration: "this RPC may take up to 200 ms". It is local — each hop sets its own. A deadline is a wall-clock instant: "this RPC must complete before 20:47:12.430 UTC". It is global — set once at the front door, every hop reads the same number. The difference looks pedantic until you trace one request through five services and discover that "200 ms" at hop 1 and "200 ms" at hop 4 add up to a worst case of 1 second, even though no single hop ever violated its own contract.

Why per-hop timeouts add up while a deadline does notTwo stacked timelines. Top: per-hop timeouts each set to 200ms; total worst case is 1000ms. Bottom: a single deadline at 200ms inherited at each hop; total is bounded by 200ms. Same five-hop call tree, two reference frames Per-hop timeouts (each 200 ms) A→B 200ms B→C 200ms C→D 200ms D→E 200ms E→F 200ms Worst-case wall-clock: 5 × 200 = 1000 ms — caller already returned DEADLINE_EXCEEDED at 200 ms Single deadline (set at A: now+200 ms) A: 200ms left B: 178ms left C: 142ms left D: 99ms left E: 41ms left Worst-case wall-clock: 200 ms — every hop knows the global budget shrinking under it Illustrative — actual remaining-time decay depends on per-hop work; the point is the absolute boundary stays fixed.
Same call tree, two reference frames. Per-hop timeouts are independent durations and compose multiplicatively in the worst case. A deadline is a single absolute timestamp — every hop reads the same wall-clock target and adjusts.

The reason this matters operationally: under load, a 200 ms hop can take 195 ms (slow tail) and still succeed. The next hop, given a 200 ms timeout, might take another 195 ms and also succeed — but the caller, who started its 200 ms clock at hop 1, has already returned an error to the user. Every byte the deeper hops produce is being computed for an audience that has left the room. Deadlines refuse to start work that cannot finish before the caller gives up; per-hop timeouts cannot — they have no idea what the caller's clock said.

Three things the deadline must answer

For deadlines to work in production, every hop has to be able to answer three questions in O(1) time, every time it is about to make a downstream call.

The first question is what is the deadline? Stored as an absolute monotonic instant — never as "200 ms from now" — so that the value does not drift as it travels through queues, retries, and serialization. gRPC encodes it on the wire in the grpc-timeout header as a duration ("178m" for 178 milliseconds), but the server reconstructs an absolute deadline locally the moment the header arrives, and never reuses the duration after that.

The second question is how much time is left? remaining = deadline − now(). If remaining ≤ 0, fail fast with DEADLINE_EXCEEDED and do not make the downstream call at all. Why "fail fast before calling" matters more than "fail when the call comes back": every doomed downstream call still costs CPU and connection-pool slots on the server side, even if the caller will discard the result. Under load, refusing to start work that has already missed its deadline is the single largest source of recovered capacity. Google's gRPC team measured this on Borg in 2014: ~30% of capacity in deeply-nested services was being spent on RPCs that had already exceeded their deadlines — work whose result was thrown away.

The third question is what budget do I send downstream? outbound_deadline = inbound_deadline − local_budget, where local_budget is the time you reserve for your own work after the downstream call returns (response serialization, error handling, the ack). If you naively forward the inbound deadline, your downstream callee may take exactly that long and leave you zero time to formulate a response — your own deadline has expired by the time you have a result to send back. A typical reservation is 5–20 ms, more if the local work involves CPU-heavy serialization.

How a deadline propagates and shrinks across hopsFive service nodes A through E in a horizontal call chain. Each node shows inbound deadline, network delay, local processing time, and outbound deadline reduced by the consumed budget. Deadline propagation: budget shrinks at every hop A: edge deadline= now+200ms −12ms net B: gateway left=178ms reserve 5ms send 173ms C: payments left=160ms reserve 18ms send 142ms D: risk left=132ms reserve 33ms send 99ms E: scoring left=87ms DB query At E, the database query starts. If the worst-case query is 120 ms but only 87 ms is left: — refuse the query at the connection layer; return DEADLINE_EXCEEDED locally. — the database row-lock is not acquired; the connection slot is not held. — C, B, A still get back DEADLINE_EXCEEDED, but the system has not done useless work. Illustrative — real reserves at PaySetu range 5–20 ms per hop, derived from p95 of local processing time.
The remaining-time field shrinks at every hop, by network delay plus local-budget reservation. A hop that sees `left ≤ worst-case-local-work` refuses to start instead of starting and timing out mid-flight.

Three failure modes deadline propagation prevents

Wasted database queries on abandoned requests

This is Aditi's case from the lead. The merchant SDK had a 200 ms client deadline. The gateway forwarded it. Payments forwarded it. Risk forwarded it. But the scoring service ran a Postgres query against an index that had bloated overnight, and the query took 6 seconds. The risk service did not propagate the deadline to its psycopg2.connect() — the connection had a 30-second statement_timeout set globally — so the query ran to completion. Multiply by the retry storm. The fix was a one-line change: cursor.execute(sql, timeout=remaining_ms) per query, where remaining_ms was read from the gRPC context's deadline. After the deploy, scoring CPU dropped 41% during peak; the front-door p99 stayed flat.

The retry-amplification cascade

If hop A retries hop B with a 100 ms timeout, and hop B's call to hop C has the same 100 ms timeout, you get the canonical retry cascade: A's first attempt times out at 100 ms; A retries, sending another full 100 ms timeout; B starts a fresh call to C; and now C sees two concurrent requests for the same logical operation, both with full budgets, both about to time out and trigger their own retries. With a propagated deadline, A's retry sends remaining_after_first_attempt — perhaps 30 ms — and C immediately sees that the deadline is too close to start work, refuses fast, and the cascade dies in one round trip instead of three.

The infinite-loop work-graph

Some service graphs have cycles you didn't realise existed. Search → ProductCatalog → Recommendations → Search is a real example from BharatBazaar's mid-2024 architecture review — a cyclic call where Recommendations occasionally needed catalog metadata that triggered the search path again. Without deadline propagation, a request entering the cycle could loop until the JVM's stack limit. With deadlines, every hop subtracts from the budget; the cycle terminates when budget reaches zero, regardless of whether the loop was supposed to exist. The deadline acts as a safety fence even for bugs the architects didn't anticipate.

Code: a deadline-aware RPC client with budget reservation

This is the smallest faithful implementation of deadline propagation: an asyncio-style client that carries an absolute deadline through a chain of calls, reserves local budget at each hop, and refuses to start downstream work that cannot finish in the time remaining.

# deadlines.py — propagating an absolute deadline through a call chain
import asyncio, time, random
from dataclasses import dataclass

@dataclass(frozen=True)
class Deadline:
    abs_ts: float                   # monotonic seconds, set once at the edge
    @staticmethod
    def in_(ms: float) -> "Deadline":
        return Deadline(time.monotonic() + ms / 1000)
    def remaining_ms(self) -> float:
        return max(0.0, (self.abs_ts - time.monotonic()) * 1000)
    def expired(self) -> bool:
        return self.remaining_ms() <= 0

class DeadlineExceeded(Exception): pass

async def call(name: str, dl: Deadline, work_ms: float, reserve_ms: float = 8.0):
    """Refuse work that cannot finish; otherwise simulate it under wait_for(remaining)."""
    left = dl.remaining_ms()
    if left <= work_ms + reserve_ms:
        print(f"  [{name}] REFUSE: left={left:.1f}ms, need {work_ms:.0f}+{reserve_ms:.0f}ms")
        raise DeadlineExceeded(name)
    print(f"  [{name}] start: left={left:.1f}ms, will work {work_ms:.0f}ms")
    try:
        await asyncio.wait_for(asyncio.sleep(work_ms / 1000), timeout=left / 1000)
    except asyncio.TimeoutError:
        raise DeadlineExceeded(name)
    print(f"  [{name}] done:  left={dl.remaining_ms():.1f}ms")

async def edge_request(total_budget_ms: float, db_work_ms: float):
    dl = Deadline.in_(total_budget_ms)              # set once at the front door
    print(f"\nedge: deadline={total_budget_ms:.0f}ms, db_work={db_work_ms:.0f}ms")
    try:
        await call("gateway",  dl, work_ms=12)
        await call("payments", dl, work_ms=18)
        await call("risk",     dl, work_ms=14)
        await call("scoring",  dl, work_ms=db_work_ms, reserve_ms=2)  # leaf, less reserve
    except DeadlineExceeded as e:
        print(f"  -> DEADLINE_EXCEEDED at {e}, total elapsed = "
              f"{(total_budget_ms - dl.remaining_ms()):.1f}ms")

async def main():
    random.seed(42)
    await edge_request(total_budget_ms=200, db_work_ms=40)   # comfortably fits
    await edge_request(total_budget_ms=200, db_work_ms=140)  # leaf refuses

asyncio.run(main())

Sample run:

edge: deadline=200ms, db_work=40ms
  [gateway] start: left=200.0ms, will work 12ms
  [gateway] done:  left=188.0ms
  [payments] start: left=188.0ms, will work 18ms
  [payments] done:  left=170.0ms
  [risk] start: left=170.0ms, will work 14ms
  [risk] done:  left=156.0ms
  [scoring] start: left=156.0ms, will work 40ms
  [scoring] done:  left=116.0ms

edge: deadline=200ms, db_work=140ms
  [gateway] start: left=200.0ms, will work 12ms
  [gateway] done:  left=188.0ms
  [payments] start: left=188.0ms, will work 18ms
  [payments] done:  left=170.0ms
  [risk] start: left=170.0ms, will work 14ms
  [risk] done:  left=156.0ms
  [scoring] REFUSE: left=156.0ms, need 140+2ms
  -> DEADLINE_EXCEEDED at scoring, total elapsed = 44.0ms

Walkthrough. The line dl = Deadline.in_(total_budget_ms) is the only place a relative duration becomes an absolute monotonic instant — every subsequent hop reads dl.remaining_ms() and never has to know what the original budget was. The line if left <= work_ms + reserve_ms: raise DeadlineExceeded(name) is the fail-fast check: the leaf hop knows its own worst-case work (here, 140 ms for the slow database query) and refuses before doing anything that would acquire a connection or a row lock. Why fail-fast at the leaf is more valuable than at the edge: the edge can return DEADLINE_EXCEEDED to the caller in either case — the user gets the same error. What changes is what the system did before the error. With fail-fast at the leaf, the database connection is returned to the pool unused; the row lock is never taken; the query plan cache is not polluted. This is the difference between burning 44 ms of wall time (case 2) and burning 156 ms plus a held DB lock plus a poisoned plan cache.

The reserve_ms=8.0 on intermediate hops and reserve_ms=2 on the leaf is the local-budget reservation pattern: at each hop, the caller subtracts what it expects to need post-downstream-return (response serialization, error formatting, gRPC trailer write). Without that reservation, a hop could call downstream with the full remaining budget, the downstream could take exactly that long, and the caller's own deadline would have expired by the time it had a result to return. Why monotonic clocks (time.monotonic()) and not wall-clock (time.time()): wall-clock is subject to NTP step changes, leap seconds, and (on bad VMs) jumps backwards by minutes. A deadline computed against wall-clock can travel forward in time and cause a hop to either start work it should have refused, or refuse work it should have done. Monotonic clocks are guaranteed to advance — never jump backwards — so remaining = abs_ts − monotonic_now() is always meaningful. gRPC, Envoy, Linkerd all use monotonic for deadlines internally; the wire-format duration in grpc-timeout is reconstructed to a monotonic absolute on receipt.

Common confusions

Going deeper

gRPC's grpc-timeout header and the monotonic-clock conversion

gRPC encodes the deadline on the wire as the grpc-timeout HTTP/2 header — a duration plus a unit, like 200m for milliseconds, 5S for seconds, or 1H for hours. The format is intentionally compact (1–8 ASCII bytes) because every gRPC call carries it. On the server side, the moment the header arrives in the HEADERS frame, the gRPC runtime computes abs_deadline = now_monotonic() + parsed_duration and stores the absolute instant on the request context. Every subsequent ctx.Deadline() call returns that absolute. When the server makes a downstream gRPC call, it serialises remaining = abs_deadline − now_monotonic() back into a grpc-timeout header — a new duration, computed on each outbound request. This round-trip preserves the absolute boundary even though the wire format is duration-based. The behaviour is specified in the gRPC core spec.

Why per-call deadlines need a hedged-request budget on top

Tail-latency mitigation (hedged requests, see Dean & Barroso, "The Tail at Scale") sends a duplicate request to a second replica after the first has been outstanding for more than the p95 latency. The hedged second request inherits the same deadline as the original — which means as the deadline approaches, the hedge has progressively less time to complete. A naïve hedger sends the duplicate even when only 10 ms remain, and the duplicate has zero chance of beating the original. Production hedged-request systems (Google's own internal RPC stack, Cloudflare's hedge implementation) skip the hedge when remaining < hedge_minimum_useful_budget — typically the median latency of the downstream — because below that threshold the hedge is pure overhead.

Deadline-aware databases: PostgreSQL statement_timeout, MySQL MAX_EXECUTION_TIME

Most relational databases let you set a per-statement deadline at execution time, not just per-connection. Postgres's SET LOCAL statement_timeout = '178ms' inside a transaction (or psycopg's cursor.execute(sql, prepare=True, timeout=...) patterns) caps how long the database will let a query run. If the query exceeds the timeout, the database cancels it server-side and returns an error; the row locks the query held are released as part of the abort. Plumbing the propagated remaining-deadline into this per-statement timeout is the difference between Aditi's database burning 6 seconds on an abandoned scoring query and the same query being cancelled at 156 ms with the row locks released. The same pattern applies to MySQL (SET STATEMENT MAX_EXECUTION_TIME = 178 ...), MongoDB (maxTimeMS), and Cassandra (request_timeout per session).

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 deadlines.py

# To watch a real grpc-timeout header on the wire:
pip install grpcio grpcio-tools
# In one terminal: python3 -m grpc_tools.protoc --python_out=. --grpc_python_out=. echo.proto
# In another:      tcpdump -i lo -A -s 0 'tcp port 50051' | grep -i grpc-timeout

Where this leads next

Deadlines are the budget; what you do with the budget is the discipline of the remaining Part 4 chapters. They feed into:

Beyond Part 4, deadlines surface in Part 7 (reliability patterns — retries, hedged requests, and circuit breakers all consult remaining), in Part 14 (distributed transactions — 2PC's prepare-phase has a deadline; coordinators that miss it must abort), and in Part 17 (geo-distribution — cross-region RPCs eat 80–250 ms of WAN RTT before any work starts, which forces front-door deadlines to be sized as WAN_RTT + per-region budget × hop_depth).

References

  1. gRPC over HTTP/2 — protocol specification — the canonical definition of the grpc-timeout header and how clients/servers convert between durations and absolute deadlines.
  2. "The Tail at Scale" — Dean & Barroso, CACM 2013 — establishes why per-request deadlines and hedged requests are mandatory in deeply-nested service graphs, with measurements from Google production.
  3. "Cancellation, Context, and Plumbing" — Sameer Ajmani, Go Blog, 2014 — the original write-up of Go's context.Context design, including why deadlines propagate explicitly through every function signature.
  4. "Timeouts, retries and backoff with jitter" — Marc Brooker, AWS Architecture Blog — operational guidance on picking timeout values from latency distributions, with Amazon production data.
  5. Envoy proxy — timeouts and deadlines documentation — how Envoy enforces per-route deadlines and surfaces gRPC-Status: 4 (DEADLINE_EXCEEDED) to clients.
  6. "The Hyrum's Law of Retry Budgets" — Marc Brooker — analysis of how retries and deadlines interact, including the budget arithmetic for retry-amplification cascades.
  7. RPC semantics: at-most-once, at-least-once, exactly-once — internal companion. Why a retry that misses its deadline must be skipped, not attempted.
  8. The fallacies of distributed computing revisited — internal companion. Deadlines are the operational refutation of fallacy #2 ("latency is zero") and fallacy #7 ("transport cost is zero").