Deadlines and deadline propagation

It is 8:47 pm at PaySetu and the dashboard for the Checkout.Pay RPC shows two numbers that should not both be true: p99 latency is 180 ms — well inside the 200 ms front-door budget — and the database CPU is pegged at 94%. Aditi pulls a flame graph from the payments-DB primary and sees what is happening: a long tail of risk-scoring queries are running for 6, 11, 17 seconds after the front-door RPC has already returned a DEADLINE_EXCEEDED error to the merchant. The user gave up. The merchant SDK retried. The original query kept burning the database. Multiply that by 4,200 retries per second during a Diwali sale and the database is doing the same expensive work three times — once for the failed request, once for the retry, and once more, gratuitously, for the abandoned tail.

A deadline is a single absolute timestamp (8:47:12.430 PM UTC) that travels with a request through every hop, telling each callee how long it has left to finish before the caller has already given up. A timeout is a duration the local hop applies; a deadline is the global budget. Without propagation, deep call trees waste compute on work nobody is waiting for; with propagation, every hop knows when to stop. The discipline is: pick an absolute time at the front door, send it on every outbound RPC, refuse to start work that cannot finish in the time remaining.

Timeout vs deadline — the same number from a different reference frame

A timeout is a duration: "this RPC may take up to 200 ms". It is local — each hop sets its own. A deadline is a wall-clock instant: "this RPC must complete before 20:47:12.430 UTC". It is global — set once at the front door, every hop reads the same number. The difference looks pedantic until you trace one request through five services and discover that "200 ms" at hop 1 and "200 ms" at hop 4 add up to a worst case of 1 second, even though no single hop ever violated its own contract.

Same call tree, two reference frames. Per-hop timeouts are independent durations and compose multiplicatively in the worst case. A deadline is a single absolute timestamp — every hop reads the same wall-clock target and adjusts.

The reason this matters operationally: under load, a 200 ms hop can take 195 ms (slow tail) and still succeed. The next hop, given a 200 ms timeout, might take another 195 ms and also succeed — but the caller, who started its 200 ms clock at hop 1, has already returned an error to the user. Every byte the deeper hops produce is being computed for an audience that has left the room. Deadlines refuse to start work that cannot finish before the caller gives up; per-hop timeouts cannot — they have no idea what the caller's clock said.

Three things the deadline must answer

For deadlines to work in production, every hop has to be able to answer three questions in O(1) time, every time it is about to make a downstream call.

The first question is what is the deadline? Stored as an absolute monotonic instant — never as "200 ms from now" — so that the value does not drift as it travels through queues, retries, and serialization. gRPC encodes it on the wire in the grpc-timeout header as a duration ("178m" for 178 milliseconds), but the server reconstructs an absolute deadline locally the moment the header arrives, and never reuses the duration after that.

The second question is how much time is left? remaining = deadline − now(). If remaining ≤ 0, fail fast with DEADLINE_EXCEEDED and do not make the downstream call at all. Why "fail fast before calling" matters more than "fail when the call comes back": every doomed downstream call still costs CPU and connection-pool slots on the server side, even if the caller will discard the result. Under load, refusing to start work that has already missed its deadline is the single largest source of recovered capacity. Google's gRPC team measured this on Borg in 2014: ~30% of capacity in deeply-nested services was being spent on RPCs that had already exceeded their deadlines — work whose result was thrown away.

The third question is what budget do I send downstream? outbound_deadline = inbound_deadline − local_budget, where local_budget is the time you reserve for your own work after the downstream call returns (response serialization, error handling, the ack). If you naively forward the inbound deadline, your downstream callee may take exactly that long and leave you zero time to formulate a response — your own deadline has expired by the time you have a result to send back. A typical reservation is 5–20 ms, more if the local work involves CPU-heavy serialization.

The remaining-time field shrinks at every hop, by network delay plus local-budget reservation. A hop that sees `left ≤ worst-case-local-work` refuses to start instead of starting and timing out mid-flight.

Three failure modes deadline propagation prevents

Wasted database queries on abandoned requests

This is Aditi's case from the lead. The merchant SDK had a 200 ms client deadline. The gateway forwarded it. Payments forwarded it. Risk forwarded it. But the scoring service ran a Postgres query against an index that had bloated overnight, and the query took 6 seconds. The risk service did not propagate the deadline to its psycopg2.connect() — the connection had a 30-second statement_timeout set globally — so the query ran to completion. Multiply by the retry storm. The fix was a one-line change: cursor.execute(sql, timeout=remaining_ms) per query, where remaining_ms was read from the gRPC context's deadline. After the deploy, scoring CPU dropped 41% during peak; the front-door p99 stayed flat.

The retry-amplification cascade

If hop A retries hop B with a 100 ms timeout, and hop B's call to hop C has the same 100 ms timeout, you get the canonical retry cascade: A's first attempt times out at 100 ms; A retries, sending another full 100 ms timeout; B starts a fresh call to C; and now C sees two concurrent requests for the same logical operation, both with full budgets, both about to time out and trigger their own retries. With a propagated deadline, A's retry sends remaining_after_first_attempt — perhaps 30 ms — and C immediately sees that the deadline is too close to start work, refuses fast, and the cascade dies in one round trip instead of three.

The infinite-loop work-graph

Some service graphs have cycles you didn't realise existed. Search → ProductCatalog → Recommendations → Search is a real example from BharatBazaar's mid-2024 architecture review — a cyclic call where Recommendations occasionally needed catalog metadata that triggered the search path again. Without deadline propagation, a request entering the cycle could loop until the JVM's stack limit. With deadlines, every hop subtracts from the budget; the cycle terminates when budget reaches zero, regardless of whether the loop was supposed to exist. The deadline acts as a safety fence even for bugs the architects didn't anticipate.

Code: a deadline-aware RPC client with budget reservation

This is the smallest faithful implementation of deadline propagation: an asyncio-style client that carries an absolute deadline through a chain of calls, reserves local budget at each hop, and refuses to start downstream work that cannot finish in the time remaining.

# deadlines.py — propagating an absolute deadline through a call chain
import asyncio, time, random
from dataclasses import dataclass

@dataclass(frozen=True)
class Deadline:
    abs_ts: float                   # monotonic seconds, set once at the edge
    @staticmethod
    def in_(ms: float) -> "Deadline":
        return Deadline(time.monotonic() + ms / 1000)
    def remaining_ms(self) -> float:
        return max(0.0, (self.abs_ts - time.monotonic()) * 1000)
    def expired(self) -> bool:
        return self.remaining_ms() <= 0

class DeadlineExceeded(Exception): pass

async def call(name: str, dl: Deadline, work_ms: float, reserve_ms: float = 8.0):
    """Refuse work that cannot finish; otherwise simulate it under wait_for(remaining)."""
    left = dl.remaining_ms()
    if left <= work_ms + reserve_ms:
        print(f"  [{name}] REFUSE: left={left:.1f}ms, need {work_ms:.0f}+{reserve_ms:.0f}ms")
        raise DeadlineExceeded(name)
    print(f"  [{name}] start: left={left:.1f}ms, will work {work_ms:.0f}ms")
    try:
        await asyncio.wait_for(asyncio.sleep(work_ms / 1000), timeout=left / 1000)
    except asyncio.TimeoutError:
        raise DeadlineExceeded(name)
    print(f"  [{name}] done:  left={dl.remaining_ms():.1f}ms")

async def edge_request(total_budget_ms: float, db_work_ms: float):
    dl = Deadline.in_(total_budget_ms)              # set once at the front door
    print(f"\nedge: deadline={total_budget_ms:.0f}ms, db_work={db_work_ms:.0f}ms")
    try:
        await call("gateway",  dl, work_ms=12)
        await call("payments", dl, work_ms=18)
        await call("risk",     dl, work_ms=14)
        await call("scoring",  dl, work_ms=db_work_ms, reserve_ms=2)  # leaf, less reserve
    except DeadlineExceeded as e:
        print(f"  -> DEADLINE_EXCEEDED at {e}, total elapsed = "
              f"{(total_budget_ms - dl.remaining_ms()):.1f}ms")

async def main():
    random.seed(42)
    await edge_request(total_budget_ms=200, db_work_ms=40)   # comfortably fits
    await edge_request(total_budget_ms=200, db_work_ms=140)  # leaf refuses

asyncio.run(main())

Sample run:

edge: deadline=200ms, db_work=40ms
  [gateway] start: left=200.0ms, will work 12ms
  [gateway] done:  left=188.0ms
  [payments] start: left=188.0ms, will work 18ms
  [payments] done:  left=170.0ms
  [risk] start: left=170.0ms, will work 14ms
  [risk] done:  left=156.0ms
  [scoring] start: left=156.0ms, will work 40ms
  [scoring] done:  left=116.0ms

edge: deadline=200ms, db_work=140ms
  [gateway] start: left=200.0ms, will work 12ms
  [gateway] done:  left=188.0ms
  [payments] start: left=188.0ms, will work 18ms
  [payments] done:  left=170.0ms
  [risk] start: left=170.0ms, will work 14ms
  [risk] done:  left=156.0ms
  [scoring] REFUSE: left=156.0ms, need 140+2ms
  -> DEADLINE_EXCEEDED at scoring, total elapsed = 44.0ms

Walkthrough. The line dl = Deadline.in_(total_budget_ms) is the only place a relative duration becomes an absolute monotonic instant — every subsequent hop reads dl.remaining_ms() and never has to know what the original budget was. The line if left <= work_ms + reserve_ms: raise DeadlineExceeded(name) is the fail-fast check: the leaf hop knows its own worst-case work (here, 140 ms for the slow database query) and refuses before doing anything that would acquire a connection or a row lock. Why fail-fast at the leaf is more valuable than at the edge: the edge can return DEADLINE_EXCEEDED to the caller in either case — the user gets the same error. What changes is what the system did before the error. With fail-fast at the leaf, the database connection is returned to the pool unused; the row lock is never taken; the query plan cache is not polluted. This is the difference between burning 44 ms of wall time (case 2) and burning 156 ms plus a held DB lock plus a poisoned plan cache.

The reserve_ms=8.0 on intermediate hops and reserve_ms=2 on the leaf is the local-budget reservation pattern: at each hop, the caller subtracts what it expects to need post-downstream-return (response serialization, error formatting, gRPC trailer write). Without that reservation, a hop could call downstream with the full remaining budget, the downstream could take exactly that long, and the caller's own deadline would have expired by the time it had a result to return. Why monotonic clocks (time.monotonic()) and not wall-clock (time.time()): wall-clock is subject to NTP step changes, leap seconds, and (on bad VMs) jumps backwards by minutes. A deadline computed against wall-clock can travel forward in time and cause a hop to either start work it should have refused, or refuse work it should have done. Monotonic clocks are guaranteed to advance — never jump backwards — so remaining = abs_ts − monotonic_now() is always meaningful. gRPC, Envoy, Linkerd all use monotonic for deadlines internally; the wire-format duration in grpc-timeout is reconstructed to a monotonic absolute on receipt.

Common confusions

"A timeout and a deadline are the same thing." A timeout is local — each hop sets one — and they compose multiplicatively in the worst case. A deadline is global — set once, propagated — and is bounded by definition. A 5-hop chain with 200 ms per-hop timeouts can take 1000 ms; the same chain with a 200 ms deadline cannot exceed 200 ms.
"Just send the duration on the wire." Sending "178 ms" over the wire is fine, but the receiver must immediately convert it to an absolute monotonic instant relative to the moment the header was parsed. Re-sending the duration through more hops without that conversion accumulates serialization and queueing delays into the deadline, eroding the budget invisibly.
"context.WithTimeout(ctx, 200*time.Millisecond) propagates correctly." It does, but only for the Go process you wrote. If your code calls db.QueryRow(ctx, ...) but uses a database driver that ignores ctx.Deadline(), the database query will run to completion regardless. Every layer in the stack — HTTP client, RPC stub, database driver, queue consumer — must honour the context's deadline; one layer that doesn't is enough to break propagation.
"exactly-once retries are safe with deadlines." Even an idempotent retry costs work — the underlying call still runs at least once. If the deadline has already expired when the retry fires, the retry must not start. Deadline propagation interacts with retry policy: every retry decrements the remaining budget; a retry that cannot complete before the deadline must be skipped, not attempted. See RPC semantics: at-most-once, at-least-once, exactly-once for why "exactly-once delivery" is a misnomer for what is really "idempotent at-least-once".
"Setting a 60-second timeout is cautious." It is the opposite — it is reckless. A 60-second timeout means under load, every hung connection ties up a worker for a full minute. If your normal p99 is 80 ms, a deadline of 200 ms (2.5× p99) will fail-fast on real problems and return capacity to the pool; 60 seconds will let the queue grow until the JVM heap exhausts. Pick deadlines from the latency distribution, not from intuition.
"Deadlines and circuit breakers do the same thing." They are complementary. Deadlines protect a single in-flight request from running forever; circuit breakers protect the next request from being sent to a known-broken downstream. A circuit breaker trips when N consecutive requests have hit DEADLINE_EXCEEDED; the deadline is what defines "hit". Without deadlines, the circuit breaker's trigger never fires.

Going deeper

gRPC's `grpc-timeout` header and the monotonic-clock conversion

gRPC encodes the deadline on the wire as the grpc-timeout HTTP/2 header — a duration plus a unit, like 200m for milliseconds, 5S for seconds, or 1H for hours. The format is intentionally compact (1–8 ASCII bytes) because every gRPC call carries it. On the server side, the moment the header arrives in the HEADERS frame, the gRPC runtime computes abs_deadline = now_monotonic() + parsed_duration and stores the absolute instant on the request context. Every subsequent ctx.Deadline() call returns that absolute. When the server makes a downstream gRPC call, it serialises remaining = abs_deadline − now_monotonic() back into a grpc-timeout header — a new duration, computed on each outbound request. This round-trip preserves the absolute boundary even though the wire format is duration-based. The behaviour is specified in the gRPC core spec.

Why per-call deadlines need a hedged-request budget on top

Tail-latency mitigation (hedged requests, see Dean & Barroso, "The Tail at Scale") sends a duplicate request to a second replica after the first has been outstanding for more than the p95 latency. The hedged second request inherits the same deadline as the original — which means as the deadline approaches, the hedge has progressively less time to complete. A naïve hedger sends the duplicate even when only 10 ms remain, and the duplicate has zero chance of beating the original. Production hedged-request systems (Google's own internal RPC stack, Cloudflare's hedge implementation) skip the hedge when remaining < hedge_minimum_useful_budget — typically the median latency of the downstream — because below that threshold the hedge is pure overhead.

Deadline-aware databases: PostgreSQL `statement_timeout`, MySQL `MAX_EXECUTION_TIME`

Most relational databases let you set a per-statement deadline at execution time, not just per-connection. Postgres's SET LOCAL statement_timeout = '178ms' inside a transaction (or psycopg's cursor.execute(sql, prepare=True, timeout=...) patterns) caps how long the database will let a query run. If the query exceeds the timeout, the database cancels it server-side and returns an error; the row locks the query held are released as part of the abort. Plumbing the propagated remaining-deadline into this per-statement timeout is the difference between Aditi's database burning 6 seconds on an abandoned scoring query and the same query being cancelled at 156 ms with the row locks released. The same pattern applies to MySQL (SET STATEMENT MAX_EXECUTION_TIME = 178 ...), MongoDB (maxTimeMS), and Cassandra (request_timeout per session).

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
python3 deadlines.py

# To watch a real grpc-timeout header on the wire:
pip install grpcio grpcio-tools
# In one terminal: python3 -m grpc_tools.protoc --python_out=. --grpc_python_out=. echo.proto
# In another:      tcpdump -i lo -A -s 0 'tcp port 50051' | grep -i grpc-timeout

Where this leads next

Deadlines are the budget; what you do with the budget is the discipline of the remaining Part 4 chapters. They feed into:

RPC semantics: at-most-once, at-least-once, exactly-once — every retry must respect the remaining deadline, and "exactly-once" pipelines collapse the moment a retry runs after the deadline has expired.
The fallacies of distributed computing revisited — fallacy #2 is "latency is zero". Deadlines are the operational answer: stop pretending the network is fast, just tell every hop when to give up.
gRPC internals — gRPC's grpc-timeout header is the wire-format embodiment of deadline propagation, and Envoy's per-route timeouts are the proxy-level enforcement layer.

Beyond Part 4, deadlines surface in Part 7 (reliability patterns — retries, hedged requests, and circuit breakers all consult remaining), in Part 14 (distributed transactions — 2PC's prepare-phase has a deadline; coordinators that miss it must abort), and in Part 17 (geo-distribution — cross-region RPCs eat 80–250 ms of WAN RTT before any work starts, which forces front-door deadlines to be sized as WAN_RTT + per-region budget × hop_depth).

References

gRPC over HTTP/2 — protocol specification — the canonical definition of the grpc-timeout header and how clients/servers convert between durations and absolute deadlines.
"The Tail at Scale" — Dean & Barroso, CACM 2013 — establishes why per-request deadlines and hedged requests are mandatory in deeply-nested service graphs, with measurements from Google production.
"Cancellation, Context, and Plumbing" — Sameer Ajmani, Go Blog, 2014 — the original write-up of Go's context.Context design, including why deadlines propagate explicitly through every function signature.
"Timeouts, retries and backoff with jitter" — Marc Brooker, AWS Architecture Blog — operational guidance on picking timeout values from latency distributions, with Amazon production data.
Envoy proxy — timeouts and deadlines documentation — how Envoy enforces per-route deadlines and surfaces gRPC-Status: 4 (DEADLINE_EXCEEDED) to clients.
"The Hyrum's Law of Retry Budgets" — Marc Brooker — analysis of how retries and deadlines interact, including the budget arithmetic for retry-amplification cascades.
RPC semantics: at-most-once, at-least-once, exactly-once — internal companion. Why a retry that misses its deadline must be skipped, not attempted.
The fallacies of distributed computing revisited — internal companion. Deadlines are the operational refutation of fallacy #2 ("latency is zero") and fallacy #7 ("transport cost is zero").