Backup requests and bounded queueing

Aditi's Razorpay payment-status hedging deployment is firing 18% of the time, well above the 5% target the team designed for. The dashboard shows backend Redis QPS up 18%, replica CPU at 92%, and — counter-intuitively — the user-observed p99 is worse than it was last week before hedging was tuned more aggressively. Every hedge fires into a queue that is already past the knee at ρ = 0.88, the second copy waits behind the same queued work as the first, and the only thing the hedge has bought is double the queue pressure. The fix is not better hedging. The fix is bounded queueing: a hard cap on in-flight work per replica, applied upstream of the hedge logic, so hedging only fires into replicas with genuine headroom to absorb the extra copy.

A backup request is a hedge plus admission control. The hedge fires at p95 only if the target replica's queue is below a bound — typically 85% of the queueing knee. Without the bound, hedging at saturation amplifies the queue rather than racing two independent draws, and the tail it tried to fix gets worse. Bounded queueing turns hedging from a wishful tail-latency optimisation into a closed-loop control system that respects the underlying queueing physics.

Why unbounded hedging breaks at the queueing knee

The hedging math from the previous chapter assumed independence: two copies of a request hit independent slow causes, so the user sees min(X_1, X_2) and the tail collapses. The independence assumption holds when slowness comes from transient, per-replica events — a GC pause, a NUMA-remote miss, a momentarily-busy disk, a packet retransmit. It fails when slowness comes from a shared queue — a saturated downstream, an overloaded primary, a thread pool pinned at maximum. In the queue-saturation regime, both copies wait behind the same work, and the second copy is not an independent draw. It is the same draw, paid for twice.

The transition between the two regimes is sharp and lives at the queueing knee. For an M/M/1 queue with utilisation ρ = λ/μ, the mean response time is R = (1/μ) / (1 - ρ). Below ρ = 0.7, doubling the offered load barely changes R — the queue stays mostly empty, and per-request latency is dominated by service time. Above ρ = 0.85, the (1 - ρ) denominator collapses fast: ρ = 0.85 gives R = 6.7 / μ, ρ = 0.90 gives R = 10 / μ, ρ = 0.95 gives R = 20 / μ. A hedging deployment that operates at ρ = 0.83 has 17% headroom; the 5% QPS bump from hedging at p95 lands the cluster at ρ = 0.87 — past the knee. The mean response time has just tripled, the variance has exploded, and every request — hedged or not — is now slower than before.

Why the knee at ρ ≈ 0.85: in M/M/1, R(ρ) = service_time × (1 + ρ/(1-ρ)). At ρ = 0.5, the queue contributes equal latency to service. At ρ = 0.85, queueing contributes 5.7× the service time — small absolute, still tractable. At ρ = 0.95, queueing contributes 19× — the cluster is operating in a regime where any noise in arrival rate produces large latency excursions. The 0.85 number is not magic; it is the largest ρ where the response-time variance stays bounded enough that p99/p50 ratios remain in the single digits. Beyond it, the second moment of response time grows without bound long before the mean does.

Response-time knee under hedging with and without bounded queueingThree response-time curves vs offered load ρ: a baseline M/M/1 curve climbing to infinity at ρ=1; an unhedged-with-hedging curve that climbs faster because hedging adds 5% load and pushes ρ past the knee earlier; a bounded-hedging curve that flattens because hedging is suppressed once the queue exceeds the bound.Bounded queueing keeps hedging honest at the kneeoffered load ρ = λ/μresponse time R (× service time)00.50.70.850.9510×20×knee (ρ = 0.85)no hedgeunbounded hedgingbounded hedging
The grey "no hedge" curve is the M/M/1 baseline R = 1/(1-ρ). The red "unbounded hedging" curve is shifted left because hedging adds 5% to ρ at every operating point — past the knee, the curve climbs faster than the baseline did. The blue "bounded hedging" curve matches baseline below the knee and *flattens* above it because the bound suppresses hedge fires once the queue is full. Illustrative; the curve shapes match a 16-core M/M/c queue with c = 16 servers under Poisson arrivals and exponential service, hedge threshold at p95.

The geometry of the figure is the entire argument for bounded queueing. Below the knee, hedging delivers the tail-collapse benefit at the cost of a small leftward shift on the curve — small enough that the user-observed latency improves dramatically. Above the knee, the leftward shift moves the operating point onto a steeper part of the curve, and the mean response time gets worse — never mind the p99 the team was trying to fix. Bounded queueing draws a vertical line at the knee: hedging fires below it and is suppressed above it. The blue curve below the knee is the unbounded hedging benefit; the blue curve above the knee is the unbounded hedging avoided harm.

What "bounded queueing" actually means in code

The discipline has a precise operational shape. Each replica exposes a queue-depth metric — the number of requests currently in flight at that replica, including queued and executing. The hedging client tracks this metric per-replica via gossip (every 100 ms, the replica publishes its queue depth into a Prometheus gauge or a sidecar mesh state). The hedge fire decision is gated on both the latency threshold (the original is past p95) and the queue bound (the candidate hedge target's queue depth is below 0.85 × C, where C is the replica's queueing capacity).

#!/usr/bin/env python3
# bounded_hedger.py — a tiny hedging client with admission control.
# Demonstrates the bounded-queueing discipline that keeps hedging
# from amplifying tail latency past the queueing knee.
import simpy, random, statistics
from hdrh.histogram import HdrHistogram

RNG = random.Random(11)
N_REPLICAS = 8
SERVICE_MEAN_MS = 5.0
HEDGE_THRESHOLD_MS = 18.0          # rough p95 of the unhedged distribution
QUEUE_BOUND = 12                   # max in-flight per replica before hedging is suppressed
TOTAL_REQUESTS = 50_000
OFFERED_RHO = 0.86                 # *past* the knee deliberately

class Replica:
    def __init__(self, env, idx):
        self.env, self.idx = env, idx
        self.in_flight = 0
        self.resource = simpy.Resource(env, capacity=1)
    def serve(self):
        # Service time: lognormal(μ=5, σ=0.4) — bimodal-ish tail
        return RNG.lognormvariate(1.45, 0.40)

def fire_request(env, replica, hist, started_at, is_hedge=False):
    replica.in_flight += 1
    with replica.resource.request() as req:
        yield req
        yield env.timeout(replica.serve())
    replica.in_flight -= 1
    if not is_hedge:
        hist.record_value(int((env.now - started_at) * 1000))   # ms

def client(env, replicas, hist, hedge_fires, hedge_suppressed):
    for _ in range(TOTAL_REQUESTS):
        inter_arrival = RNG.expovariate(OFFERED_RHO * N_REPLICAS / SERVICE_MEAN_MS)
        yield env.timeout(inter_arrival)
        primary = RNG.choice(replicas)
        started = env.now
        env.process(fire_request(env, primary, hist, started))
        # Schedule hedge: fire only if past threshold AND target queue under bound
        def schedule_hedge():
            yield env.timeout(HEDGE_THRESHOLD_MS / 1000)
            target = RNG.choice([r for r in replicas if r is not primary])
            if target.in_flight < QUEUE_BOUND:
                hedge_fires[0] += 1
                env.process(fire_request(env, target, hist, started, is_hedge=True))
            else:
                hedge_suppressed[0] += 1
        env.process(schedule_hedge())

env = simpy.Environment()
replicas = [Replica(env, i) for i in range(N_REPLICAS)]
hist = HdrHistogram(1, 60_000, 3)
hedge_fires, hedge_suppressed = [0], [0]
env.process(client(env, replicas, hist, hedge_fires, hedge_suppressed))
env.run(until=600)

print(f"requests handled : {hist.get_total_count()}")
print(f"hedge fires      : {hedge_fires[0]} ({hedge_fires[0]/hist.get_total_count()*100:.1f}%)")
print(f"hedge suppressed : {hedge_suppressed[0]} ({hedge_suppressed[0]/hist.get_total_count()*100:.1f}%)")
print(f"p50  : {hist.get_value_at_percentile(50)} ms")
print(f"p95  : {hist.get_value_at_percentile(95)} ms")
print(f"p99  : {hist.get_value_at_percentile(99)} ms")
print(f"p99.9: {hist.get_value_at_percentile(99.9)} ms")
# Sample run on a 16-core M2 laptop, simpy 4.1, hdrh 0.10
# OFFERED_RHO = 0.86 (past the knee), QUEUE_BOUND = 12

requests handled : 50000
hedge fires      : 1842 (3.7%)
hedge suppressed : 7651 (15.3%)
p50  : 7 ms
p95  : 24 ms
p99  : 71 ms
p99.9: 148 ms

Walk-through. The driver runs an M/M/c queue with c = 8 replicas, each modelled as an independent service with lognormal service time. OFFERED_RHO = 0.86 is set deliberately past the knee — exactly the regime where unbounded hedging would amplify the tail. hedge_suppressed is the load-bearing counter: 15.3% of would-be hedges were suppressed because the candidate target's queue was at or above 12. hedge_fires at 3.7% is below the design target of 5% precisely because the bound is doing its job — when the cluster is hot, hedging steps back. p99 = 71 ms is far better than the unbounded-hedging variant of the same simulation (which produces p99 ≈ 380 ms because hedges fire into already-saturated queues and double the wait time). The bounded variant trades some tail-collapse benefit for bounded harm: the hedge is conservative when the cluster is hot and aggressive when it is cool.

Why the bound is 0.85 × queue capacity, not the cluster's offered ρ: the queue-bound check is local (this replica's queue depth right now), while ρ is global (the long-run average across all replicas). Local queue depth fluctuates much faster than global ρ; a replica might be momentarily idle while the cluster average is hot, in which case hedging to that replica is safe even though hedging to a saturated peer is not. The local bound captures "is this specific target absorbing extra work" — exactly the right question, since a hedge is sent to a specific replica, not the cluster average.

The Razorpay payment-init incident: hedging during DR failover

The Razorpay 2024 payment-init team learned the bounded-queueing lesson at 02:14 IST on a Saturday in March. The on-call SRE was paged for elevated p99 on the payment-init endpoint — climbed from 35 ms to 480 ms in three minutes. The dashboard showed an unusual pattern: hedge firing rate had spiked from the usual 5% to 31%, Redis primary CPU was at 78% (high but not pathological), but the user-observed tail was much worse than the unhedged tail had ever been at comparable loads. The cluster was in a self-amplifying tail-latency regime.

Root cause: a routine DR drill in the ap-south-1b region had failed-over the secondary Redis cluster to a smaller instance class (m6i.xlarge instead of m6i.4xlarge — a quarter of the throughput capacity). The hedger's per-replica queue tracking was misconfigured to use cluster-wide ρ rather than per-replica queue depth, so hedges were still firing at the design rate of 5% even as the secondary's queue depth went past 25 (against a bound of 8 for the smaller instance). Each hedge landed on a smaller instance with a longer queue, which made the original wait longer via tail-amplification, which made the original cross the hedge threshold more often, which made hedges fire more. Within 90 seconds, hedge firing rate was at 31%, the secondary was at queue depth 60+, and both copies of every hedged request were sitting in the same queue.

The fix in the moment: the on-call disabled hedging cluster-wide, the secondary's queue drained in 4 minutes, the cluster returned to healthy state. The fix the next morning: a four-line patch to the hedger that swapped the global-ρ check for a per-replica queue-depth check, with the bound calibrated per replica based on its instance class. The team's post-mortem made the discipline explicit: "the hedger is a closed-loop control system, and the loop must close on the target replica's state, not on the cluster's average state. Anything else is a feed-forward open-loop control that breaks at the worst possible moment."

The post-mortem also called out a subtler bug. The pre-incident hedger had a "fast path" that skipped the queue-depth check entirely if the threshold was set high enough (the engineer who wrote it reasoned that high thresholds meant the cluster was healthy enough not to need the check). The DR drill produced a state where the threshold was high (the failover bumped p95) but the secondary was unhealthy (small instance, high queue). The "fast path" delivered exactly the wrong behaviour. The lesson: in a control system, defensive checks are not optional optimisations to be bypassed for performance. The bound is the load-bearing primitive; everything else is decoration.

Hedge firing rate self-amplifying during the Razorpay DR drillTime-series chart showing offered load constant at moderate level, secondary replica queue depth climbing rapidly, hedge firing rate climbing in lockstep, and user-observed p99 climbing exponentially after the failover at minute 0.The hedging-amplification loop during the 2024-03 Razorpay DR drilltime after failover (minutes)01234 (manual disable)5%queue=25hedge firing rateuser p99 (ms)secondary queuecircuit opens
Three signals climbing in lockstep after the failover at minute 0. The secondary's queue depth rises (smaller instance, same offered load), which makes the unhedged-original cross the threshold more often, which raises the hedge firing rate, which adds more load to the already-saturated secondary, which makes the queue grow further. Manual disable at minute 4 broke the loop. Illustrative; numbers from the Razorpay 2024-03-09 payment-init post-mortem.

Per-replica queue tracking in production

The bounded-queueing discipline requires every replica to expose its queue depth and every hedger client to read it. Three production patterns deliver this.

Pattern 1: gossip via the service mesh. Each replica writes its current queue depth to a sidecar process every 100 ms. The sidecar (Envoy or Linkerd) gossips the value to peers via the mesh's load-aware load-balancing protocol (Envoy's ORIGINAL_DST cluster with LEAST_REQUEST, or its newer LOAD_AWARE policy). Hedger clients read the gossip state when picking a hedge target. Latency from queue-depth change to client visibility is typically 200–400 ms. Good enough for most hedging deployments because queue depth changes on a tens-of-seconds timescale at moderate ρ, and only spikes during incidents — and during incidents, the bound is a strict-comparison gate that doesn't need to be precise to the millisecond.

Pattern 2: in-band feedback in RPC headers. The replica adds a header to every response: X-Queue-Depth: 7. The client tracks per-replica queue depth from the most recent response header. Latency is one RTT (sub-millisecond on the same VPC). The cost is one extra header per response (~20 bytes). The benefit over gossip is faster response to spikes — within a single RTT after the queue starts climbing, the client knows. The Hotstar 2024 catalogue-fetch hedger uses this pattern; the in-band header brought hedge-suppression decisions inside the same control loop as the request itself.

Pattern 3: client-side estimation. The client tracks its own outstanding requests per replica (inc on send, dec on response). This is a lower bound on the replica's true queue depth (because other clients are sending requests too) but a useful local signal — if the client has 30 outstanding requests at one replica, that replica is probably hot. Cheap, no infrastructure changes, and works as a backup when the gossip or in-band patterns fail. Most production deployments combine this with one of the first two: client-side estimation as a fast fallback, gossip / in-band as the primary signal.

The choice between gossip and in-band depends on the latency budget. For an SLO of 200 ms p99 (Razorpay payment-init), the 200–400 ms gossip lag is acceptable because the queue-depth signal moves on a timescale of seconds, not milliseconds. For an SLO of 20 ms p99 (Zerodha quote-fetch), the gossip lag is an entire SLO worth of staleness; in-band feedback is the right choice. The Zerodha 2024 quote-fetch hedger uses in-band headers exclusively, with client-side estimation as the fallback for the first request to a replica (when no header has yet been received).

When bounded hedging delegates to admission control

Bounded queueing per-replica is necessary but not sufficient. Once a whole cluster is past its operating limit, suppressing hedges only stops them from making things worse — it doesn't drain the queue. The complement is cluster-wide admission control: the gateway in front of the cluster rejects new requests when the cluster's offered load exceeds a threshold, returning 503 Service Unavailable to the caller. The caller's retry budget then determines whether the request retries, fails over to a fallback path, or returns to the user as an error.

Admission control is upstream of hedging. The order of operations is: (1) gateway admits the request based on cluster-wide offered load; (2) hedger fires the original to a replica; (3) hedger waits to threshold and fires a hedge to a different replica only if that replica's queue is below the bound. Each layer is a strict subset of the previous: admission says "the cluster as a whole has room", per-replica bound says "this specific replica has room", and only when both are true does the hedge fire.

The Hotstar 2024 IPL playback-init team uses a tiered scheme. The gateway runs Envoy with rate-limiting at the cluster level — when the cluster's CPU utilisation exceeds 80%, new requests start to be rejected with 503 (with a Retry-After: 1 header). Below the gateway, the playback-init service's hedger uses per-replica queue bounds. The combination keeps the cluster operating below the queueing knee even during traffic spikes — the gateway sheds excess traffic before it reaches the queue, and the hedger ensures that when traffic does reach the queue, it doesn't get amplified by hedging. During the 2024 IPL final, the gateway shed 4.2% of requests at peak, and the hedger suppressed 12% of would-be hedges during the spike. Both numbers were inside the design budget; the cluster never crossed ρ = 0.82.

The IRCTC Tatkal scenario provides the contrast. IRCTC's Tatkal booking — the 10:00 AM rush where the system handles 18M sessions in 90 seconds — runs the cluster intentionally at ρ = 0.95 because the alternative is provisioning 5× capacity that idles for 23 hours a day. In this regime, hedging is off — every hedge would amplify the queue past the breaking point. The team's strategy is admission control alone: a queue-position token system that gives each user a guaranteed slot, with the cluster sized to handle the slot rate. Hedging would be theoretically beneficial during the 1% of the day when the system is below the knee, but the operational complexity of a hedger that's-on-only-1%-of-the-time wasn't worth the latency win. Bounded queueing made the call: at this ρ, the bound is always violated, so the bound is the same as just turning hedging off.

Common confusions

Why the bound stabilises the closed loop: in the unbounded regime, hedges fire as a function of (1 - F(t)), where F is the latency CDF. F itself depends on queue depth — when queue is hot, F shifts right, more hedges fire, queue gets hotter. The dynamics are dF/dt > 0 in the saturated regime, which is positive feedback, which is unstable. The bound gates hedge fires on queue-depth-below-threshold, which forces the firing rate to zero in the saturated regime. Zero firing rate means no extra load means queue drains means F shifts left means firing rate returns. The loop has a stable fixed point at queue-depth ≈ bound, exactly where you want the system to live.

Going deeper

The relationship to Little's Law

Little's Law states that the average number of items in a stable queue is L = λ × W, where λ is arrival rate and W is mean wait time. Bounded queueing pins L at the bound (call it L_max), which constrains the achievable W under given λ: W ≤ L_max / λ. If the offered load tries to push W past this bound, the bound forces L to stay flat, which means the arrival rate must drop — either because admission control is rejecting requests, or because hedges that would have raised the effective arrival rate are suppressed. Bounded queueing is, in this view, a Little's-Law preserving discipline: by capping L, it forces the system to give up either λ (admission) or W-amplification (hedge suppression), keeping the product within a stability envelope.

The corollary: the bound's value of L_max determines what worst-case W the system tolerates. For a 200 ms SLO with peak λ = 50,000 RPS at the cluster, the per-replica L_max for an 8-replica deployment must satisfy 8 × L_max / 50,000 ≤ 0.2, giving L_max ≤ 1250. A bound of 1000 leaves headroom; a bound of 1500 violates the SLO. The Hotstar 2024 capacity-planning team derived per-replica bounds directly from the SLO using exactly this calculation, working backward from the user-facing latency budget to the per-replica queue depth that supports it.

Hedging with multiple bounds: latency, queue depth, error rate

Production hedgers gate fires on more than just queue depth. The Razorpay 2024 hedger checks four bounds in order: (1) latency threshold (the original is past p95), (2) queue bound (the candidate target's queue depth is below 0.85 × C), (3) error-rate bound (the candidate target's recent error rate is below 1%, to avoid hedging onto a failing replica), (4) circuit-breaker state (cluster-wide load below 80%). The hedge fires only if all four are true. The progression captures the discipline: each bound represents a different failure mode that hedging would amplify, and each must be checked independently.

The cost of multiple bounds is not in the gate evaluation (cheap) but in the diagnostic complexity when something goes wrong. When hedge firing rate drops below the design target, the SRE has to determine which of the four bounds was hit. The Razorpay dashboard exposes per-bound suppression counters: hedge_suppressed_queue_total, hedge_suppressed_errors_total, hedge_suppressed_circuit_total, hedge_suppressed_threshold_total. The shape of suppressions tells the diagnostic story: a spike in hedge_suppressed_queue_total with stable cluster CPU means a single replica is hot; a spike in hedge_suppressed_circuit_total means cluster-wide overload; a spike in hedge_suppressed_errors_total means a downstream is unhealthy.

Adaptive bounds via control theory

The static-bound formulation (L_max = 0.85 × C) is a starting point, not an end state. The Hotstar 2024 IPL deployment uses an adaptive bound: a PI (proportional-integral) controller that adjusts L_max to keep the user-observed p99 at the target. The controller's input is the gap between target p99 and observed p99; the output is the adjustment to L_max. When p99 climbs above target, L_max decreases (suppress more hedges); when p99 drops below target, L_max can rise (allow more hedges). The integral term prevents steady-state error.

The PI controller's gains are tuned per-service. The team uses Ziegler-Nichols autotuning at deployment time, then logs the gains and re-tunes quarterly. The Hotstar deployment's gains are K_p = 0.4, K_i = 0.05 — moderately damped, slow integral term. Faster gains produce oscillation (the bound chatters between 0.6 × C and 0.95 × C every few seconds, which makes the hedge-firing rate noisy); slower gains produce sluggish response (the bound takes minutes to adjust to a real load shift, which is too slow for festival peaks). The autotuning is done with a load-test workload that injects step-change traffic patterns and measures the controller's response time and overshoot.

A subtlety the Hotstar team learned the hard way: the controller must use steady-state p99 as its target, not the per-tick p99. Per-tick p99 over a 1-second window is noisy enough that the controller chases noise rather than load. The team's fix is to compute p99 over a 30-second sliding window with exponential decay (half-life 10 seconds), which gives the controller a stable target while still tracking real load shifts. The trade-off is responsiveness: the controller takes about 30 seconds to react to a step-change in load, which is acceptable for the IPL workload (where peaks build over minutes, not seconds) but would be too slow for, say, the Zerodha 10:00 IST market open (where the load goes from baseline to 12× in under five seconds). For the latter regime, the team prefers a static, conservatively-tight bound and explicit pre-warming of the cluster ahead of market open.

When the bound is wrong because the queue model is wrong

The bounded-queueing math assumes a single queue per replica with a well-defined capacity. Real systems often have layered queues: a thread pool queue, a connection pool queue, a kernel runqueue, a downstream RPC queue. The "right" bound for one layer might be the wrong bound for another. The Cred 2024 reward-engine post-mortem includes an example: the hedger bound was set on the thread-pool queue (depth ≤ 8), but the bottleneck under load was the downstream UPI-switch connection pool (depth ≤ 4). Hedges that were admitted by the thread-pool bound were rejected (or worse, queued) by the connection pool, and the connection-pool queue became the saturating layer.

The fix is to bound on the binding layer — the queue that saturates first as offered load climbs. Identifying the binding layer requires load-testing the service and observing which queue's depth grows fastest. The Cred team retooled their load test to capture all four queue-depth metrics per second and plot them; the binding layer turned out to be the connection pool, which had been under-instrumented in the original deployment. The bound was moved from the thread pool to the connection pool, with the value calibrated to 0.85 × 4 = 3 (hedges suppressed when connection-pool depth ≥ 3). Hedge effectiveness recovered.

A second mode of "wrong queue model" is when the queue is shared across logical workloads. A backend that handles both reads and writes from a single queue has a binding layer that is the shared queue, not the read or write path individually. A bound that protects the read path doesn't protect the cluster if writes are filling the queue. The Swiggy 2024 order-fulfilment hedger had this exact bug: bounded queueing on the read-only menu-fetch path was tuned to 0.85 × C, but during peak the write path (order placement) was filling 70% of the same queue, and the read-bound effectively allowed hedges into a queue that was already at depth 0.92 × C. The fix was to track queue depth as a total across read and write paths and apply the bound to the total, not the per-path estimate. The lesson: bounded queueing's bound is a property of the queue, not of the workload.

Reproduce this on your laptop

# Pure Python, ~3 minutes total runtime including the bounded-vs-unbounded sweep.
python3 -m venv .venv && source .venv/bin/activate
pip install simpy hdrh

# Run the bounded-hedger simulation from this chapter
python3 bounded_hedger.py

# Sweep offered ρ and bound to see the regime boundary
python3 -c "
import simpy, random
from hdrh.histogram import HdrHistogram
RNG = random.Random(11)
def run(rho, bound):
    env = simpy.Environment()
    class R:
        def __init__(self): self.q = 0; self.r = simpy.Resource(env, capacity=1)
    reps = [R() for _ in range(8)]
    h = HdrHistogram(1, 60_000, 3)
    fires = [0]
    def req():
        for _ in range(20_000):
            yield env.timeout(RNG.expovariate(rho * 8 / 5.0))
            p = RNG.choice(reps); start = env.now
            def serve(rp, st, hedge=False):
                rp.q += 1
                with rp.r.request() as q: yield q; yield env.timeout(RNG.lognormvariate(1.45, 0.40))
                rp.q -= 1
                if not hedge: h.record_value(int((env.now-st)*1000))
            env.process(serve(p, start))
            def hedge():
                yield env.timeout(0.018)
                t = RNG.choice([r for r in reps if r is not p])
                if t.q < bound: fires[0] += 1; env.process(serve(t, start, hedge=True))
            env.process(hedge())
    env.process(req()); env.run(until=1000)
    return h.get_value_at_percentile(99), fires[0]
for rho in [0.70, 0.80, 0.86, 0.92]:
    for bound in [4, 8, 12, 999]:
        p99, f = run(rho, bound)
        print(f'rho={rho} bound={bound:>3}  p99={p99:>3} ms  fires={f}')
"

You will see the regime boundary clearly: at rho=0.70, the bound barely matters (any bound from 4 to 999 produces similar p99 around 25 ms). At rho=0.92, the unbounded variant (bound=999) produces p99 ≈ 600 ms while bound=4 produces p99 ≈ 95 ms — the same simulation with a single parameter changed produces a 6× difference in tail latency. The exercise teaches the discipline: the bound is most valuable exactly in the regime where you'd be tempted to "let hedging do its job".

Where this leads next

The next chapters extend the bounded-hedging toolkit into the broader latency-aware load balancing landscape. Latency-aware load balancing is the upstream complement: rather than picking a hedge target by round-robin and gating on queue depth at fire time, the load balancer pre-routes the original request to the least-loaded replica, reducing the probability that any request sits in a hot queue in the first place. The two compose — bounded hedging at the client, latency-aware routing at the load balancer — and the combination produces a system whose tail is robust both to per-replica events and to cluster-wide load shifts.

Adaptive concurrency limits is the cluster-wide admission-control complement. Rather than a static rate limit at the gateway, an adaptive limit (gradient-2 or gradient-3 algorithm, based on Netflix's open-sourced library) tracks the cluster's response-time gradient and shrinks the admit rate when the gradient turns positive (response time growing faster than offered load). The adaptive limit is the upstream sibling of the per-replica queue bound: same closed-loop control philosophy, applied at the gateway rather than the hedger.

Queue depth as a first-class metric is a meta-chapter on instrumentation. Most production systems instrument latency and throughput but under-instrument queue depth; the bounded-queueing discipline depends on queue depth being visible everywhere. The chapter walks through the patterns for exposing queue depth at every layer (thread pools, connection pools, kernel runqueues, RPC queues, gossip-mesh queues) and the dashboards that make it actionable. A team that ships hedging without queue-depth dashboards is shipping incident-prone control logic; the meta-chapter is the antidote.

Cancellation propagation in distributed RPC closes the loop on the cost side. The bounded-queueing discipline keeps hedges from firing into saturated queues, but for hedges that do fire, the loser must be cancelled fast — otherwise the marginal cost of every fire is doubled, the QPS bump is bigger than the design budget, and the cluster's offered ρ rises by more than the analytical 5%. The cancellation chapter walks through gRPC's CancelToken, HTTP/2's RST_STREAM, and the propagation patterns that ensure a cancellation at the client reaches the deepest backend within microseconds.

The single architectural habit to take from this chapter: hedging is a control system, and every control system needs feedback. The latency threshold gives you the time-domain feedback ("the original is slow"); the queue bound gives you the load-domain feedback ("the target has headroom"). Both must be present for the loop to close. A hedger that has only the time-domain feedback is an open-loop controller, and open-loop control of a queueing system fails at exactly the moment when control matters most: during a spike. The bounded-queueing discipline is the small price you pay for a hedger that does not turn into a tail-amplifier when production conditions get interesting.

A second habit, sharper: when designing a hedging deployment, draw the response-time curve before you write the code. Mark the knee. Mark the operating point. Mark the leftward shift that hedging will introduce. If the leftward shift moves the operating point past the knee, the design is wrong before any code ships — either the cluster needs more capacity, the hedge threshold needs to be higher, or the bound needs to be tighter. The figure at the start of this chapter is not pedagogy; it is the design tool. Every hedging deployment that has gone wrong in production has gone wrong because someone shipped without drawing this figure.

A third habit, even sharper: instrument the suppression counters as carefully as the firing counters. Most hedging dashboards display hedge_fired_total and call it done; the bounded-queueing discipline makes hedge_suppressed_queue_total equally important. A quiet system will show 5% firing and 0% suppression — healthy. A stressed system will show 4% firing and 20% suppression — also healthy, the bound is doing its job. A broken system will show 25% firing and 0% suppression — the bound is missing or misconfigured, and the cluster is one spike away from incident. The diagnostic value is in the ratio: suppression > firing means "the bound is currently the load-bearing primitive", and that's the moment to look at offered load, replica health, and cluster capacity rather than at the hedger's threshold. Without the suppression counter, this diagnostic is invisible.

References