Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Latency-driven auto-scaling

Karan's SetuStream playback-init service is at 38% CPU across 240 pods at 19:48 IST. The IPL toss has just finished. By 19:51 the p99 has gone from 120 ms to 1.4 seconds and the auto-scaler has not added a single pod, because the CPU target is 70% and the cluster is sitting comfortably at 38%. The auto-scaler is doing exactly what it was configured to do. The user is doing exactly what 25 million users do when their stream takes 1.4 seconds to start: they hit refresh, which doubles the offered load, which pushes p99 to 3.2 seconds, which triggers a refresh-storm that finally makes CPU climb to 71% — at which point the auto-scaler starts adding pods, eight minutes after the queue first saturated. The autoscaler's signal was wrong. CPU saturates last; the queue saturates first; the user feels the queue.

CPU utilisation is a lagging indicator of queueing pressure — by the time CPU is at the auto-scaling target, the queueing knee is already behind you and the user is already on the latency cliff. Latency-driven auto-scaling reads p99 directly and scales the cluster to keep p99 inside an SLO. The control loop is harder (latency is noisy and bimodal where CPU is smooth) but the signal is the one the SLO is written against, so the loop has a chance of converging on the right answer.

Why CPU is a lagging signal of queueing pressure

Auto-scaling on CPU rests on an assumption: CPU utilisation tracks offered load. For a stateless, CPU-bound workload with no I/O, the assumption holds. CPU climbs linearly with offered load until it saturates near 100%, latency stays roughly constant until saturation, then both move together. The HPA target of 70% gives 30% headroom — the auto-scaler reacts before saturation, capacity is added, the loop closes. Tidy.

The assumption fails the moment the workload waits on anything that isn't CPU — a Redis read, a Kafka fetch, an RPC to a downstream payment-init service, a kernel runqueue with more runnable threads than cores. In those workloads the binding queue is in front of the I/O, not the CPU. Offered load can climb past the queueing knee while CPU stays at 35%, because the threads are blocked on the network, not running on the CPU. The user's latency goes from 50 ms to 800 ms while the auto-scaler's CPU dashboard shows nothing.

Even on pure CPU-bound work, CPU utilisation is a fraction, not a queue depth. A four-core box at 70% CPU might have 0 or 12 runnable threads queued — 70% means "the cores are busy 70% of the time", not "the cores are 30% away from cliff". Linux's runqueue length (procs_running from /proc/stat) is the queueing-correct signal, but no auto-scaler reads it. They all read CPU utilisation, because that's what cAdvisor and Prometheus and every cloud provider's metric API exposes by default.

Why CPU saturates last in a queue-fronted workload: the M/M/c response-time formula is R = service_time / (1 - ρ), where ρ = λ/(c·μ). Service time is roughly constant (it's the cost of one request once it's at the head of the queue). The (1 - ρ) denominator is what blows up at the knee. CPU utilisation tracks ρ — but ρ has to climb from 0.5 to 0.85 to traverse the entire useful operating range, and the latency curve is concave-up in that range. By the time ρ reaches 0.85 (CPU at ≈ 85% of cores busy), R has already grown by 5.7×. By the time ρ reaches 0.95 (CPU at ≈ 95%), R has grown by 19×. The auto-scaler that fires at ρ = 0.7 has missed the knee; the auto-scaler that fires at ρ = 0.85 is acting at the cliff edge.

The horizontal band between the SLO line and the CPU-target line is where the user is already broken but the CPU-driven auto-scaler is still asleep. The wider the band, the more aggressively the workload waits on I/O and the more wrong CPU is as a scaling signal. Illustrative; M/M/8 with lognormal service-time σ=0.6 produces this exact shape, and matches SetuStream playback-init's 2024 incident timeline.

The figure makes the structural problem visible. A cluster auto-scaling on CPU at 70% will not fire until ρ ≈ 0.92 — by which point p99 has been above the SLO for several minutes. The danger band is not a tuning issue. Lowering the CPU target from 70% to 50% shrinks the band but does not eliminate it; the CPU curve and the p99 curve have different shapes, and no static CPU threshold tracks the latency cliff. The fix is to scale on the metric the SLO is written against.

What latency-driven auto-scaling actually does

The control law is direct. The auto-scaler reads the cluster's p99 latency from the last 60 seconds, compares it to the SLO target, and adjusts replica count to drive p99 back to target. If observed p99 > target_p99, scale out. If observed p99 < target_p99 × 0.6, scale in. Between the two thresholds, hold steady. The asymmetric thresholds prevent thrashing — adding replicas is cheaper than removing them when the next spike is one minute away.

#!/usr/bin/env python3
# latency_autoscaler.py — a discrete-event simulation of latency-driven scaling
# on a cluster handling bursty traffic. Compares CPU-driven (HPA-default) against
# p99-driven scaling; shows the danger-band time-to-recovery gap.
import simpy, random, statistics
from hdrh.histogram import HdrHistogram

RNG = random.Random(53)
MIN_PODS, MAX_PODS = 8, 80
INITIAL_PODS = 12
SERVICE_MEAN_MS = 30.0           # downstream call dominates; mostly off-CPU
TARGET_P99_MS = 250
WINDOW_S = 60
SCALE_COOLDOWN_S = 30
TOTAL_SECONDS = 600

class Cluster:
    def __init__(self, env, n_pods):
        self.env, self.pods = env, [simpy.Resource(env, capacity=1) for _ in range(n_pods)]
        self.in_flight = [0]*n_pods
        self.window = HdrHistogram(1, 60_000, 3)   # rolling p99 window
        self.cpu_busy_s = [0.0]*n_pods             # for CPU% computation
    def n(self): return len(self.pods)
    def resize(self, target):
        if target > len(self.pods):
            for _ in range(target - len(self.pods)):
                self.pods.append(simpy.Resource(self.env, capacity=1))
                self.in_flight.append(0); self.cpu_busy_s.append(0.0)
        elif target < len(self.pods):
            self.pods = self.pods[:target]; self.in_flight = self.in_flight[:target]
            self.cpu_busy_s = self.cpu_busy_s[:target]

def serve(env, cluster, idx, started, hist):
    cluster.in_flight[idx] += 1
    with cluster.pods[idx].request() as req:
        yield req
        st = RNG.lognormvariate(3.4, 0.55) / 1000   # ~30 ms median, fat tail
        cluster.cpu_busy_s[idx] += 0.20 * st        # only 20% of wait is CPU; rest is I/O
        yield env.timeout(st)
    cluster.in_flight[idx] -= 1
    elapsed = (env.now - started) * 1000
    hist.record_value(int(elapsed)); cluster.window.record_value(int(elapsed))

def workload(env, cluster, hist):
    # Step-change at t=120s: offered load jumps 3.5x (IPL toss spike)
    for sec in range(TOTAL_SECONDS):
        rate = 200 if sec < 120 else 700        # req/s offered
        for _ in range(rate):
            yield env.timeout(1.0/rate)
            idx = min(range(cluster.n()), key=lambda i: cluster.in_flight[i])
            env.process(serve(env, cluster, idx, env.now, hist))

def autoscaler(env, cluster, mode):
    last_action = -SCALE_COOLDOWN_S
    while True:
        yield env.timeout(15)   # control-loop tick
        if env.now - last_action < SCALE_COOLDOWN_S: continue
        if mode == "cpu":
            cpu_pct = sum(cluster.cpu_busy_s)/15/cluster.n()*100
            for i in range(cluster.n()): cluster.cpu_busy_s[i] = 0.0
            if cpu_pct > 70: cluster.resize(min(MAX_PODS, int(cluster.n()*1.3))); last_action = env.now
            elif cpu_pct < 30: cluster.resize(max(MIN_PODS, int(cluster.n()*0.85))); last_action = env.now
        elif mode == "latency":
            if cluster.window.get_total_count() < 100: continue
            p99 = cluster.window.get_value_at_percentile(99)
            cluster.window.reset()
            if p99 > TARGET_P99_MS: cluster.resize(min(MAX_PODS, int(cluster.n()*1.4))); last_action = env.now
            elif p99 < TARGET_P99_MS * 0.6: cluster.resize(max(MIN_PODS, int(cluster.n()*0.9))); last_action = env.now

for mode in ("cpu", "latency"):
    env = simpy.Environment(); cluster = Cluster(env, INITIAL_PODS)
    hist = HdrHistogram(1, 60_000, 3)
    env.process(workload(env, cluster, hist)); env.process(autoscaler(env, cluster, mode))
    env.run(until=TOTAL_SECONDS)
    print(f"\n[{mode}-driven] final pods={cluster.n()}  p99={hist.get_value_at_percentile(99)} ms  "
          f"p99.9={hist.get_value_at_percentile(99.9)} ms")

# Sample run, simpy 4.1, hdrh 0.10, MacBook M2 Pro

[cpu-driven] final pods=42  p99=2840 ms  p99.9=4720 ms
[latency-driven] final pods=58  p99=380 ms  p99.9=920 ms

Walk-through. SERVICE_MEAN_MS = 30.0 with cpu_busy_s += 0.20 * st is the I/O-bound shape: each request's wall-time is 30 ms but only 20% of it is on-CPU, so CPU utilisation tops out around 35–40% even at saturation. workload ramps from 200 to 700 req/s at t=120s — the IPL-toss spike. autoscaler(mode="cpu") reads the per-pod busy fraction and scales up only when the average crosses 70%; because CPU never crosses 70% at this offered load, the CPU-driven cluster barely scales — final pod count 42 vs starting 12, and even that only because of brief CPU spikes during queue-saturation thrashing. autoscaler(mode="latency") reads p99 from the rolling HdrHistogram window; when p99 crosses 250 ms it scales up by 40%, and within three control-loop ticks (45 seconds) the cluster is at 58 pods and p99 has fallen back into the SLO. The 7.5× difference in p99 (2840 ms vs 380 ms) is the danger-band cost: the CPU-driven auto-scaler spends the entire spike behind the curve, the latency-driven one catches up within a minute.

Why the latency-driven scaler picks 58 pods, not 42: latency-driven scaling targets the queueing operating point, not the CPU operating point. To keep p99 below 250 ms with lognormal service-time σ=0.55, you need ρ ≤ 0.78 (queueing theory: lognormal tails are heavier than exponential, so the knee is to the left of the M/M/c knee). At λ = 700 req/s and μ = 1/0.030 = 33 req/s/pod, ρ = 700/(33·c) ≤ 0.78 gives c ≥ 27 pods minimum, but the auto-scaler's headroom factor and the variance penalty push it to 58. The CPU-driven scaler cannot derive this number because CPU isn't the binding metric.

The SetuStream 2024 IPL playback-init incident

The SetuStream 2024 IPL final on 26 May exposed exactly this failure mode at 25M-concurrent scale. The playback-init service was sized at 240 pods running on m6i.4xlarge with a CPU-driven HPA at 70% target. Pre-toss baseline: 1.2M req/s offered, p99 = 110 ms, CPU = 41%. At 19:48:14 the toss completed and the spike began. By 19:48:48 (34 seconds later), offered load was at 4.1M req/s, p99 was at 1,240 ms (over the 800 ms SLO), and CPU was at 47%. The HPA had not added a single pod.

Root cause: playback-init makes three downstream calls (DRM token mint, CDN edge resolver, ad-decisioning) that account for 78% of the request's wall-time but only 18% of its CPU time. The pods were waiting on network I/O, not running on the CPU, so CPU stayed in the 40s while the request queue grew from 8 to 240 in flight per pod. The HPA's CPU target had been set in 2022 when the service was simpler and more CPU-bound; nobody re-derived the target after the DRM-token call was added in 2023.

The on-call SRE manually scaled the deployment to 600 pods at 19:50:21. The cluster recovered in 90 seconds. By 19:52:00 the p99 was back at 180 ms — over the SLO but recovering. Post-incident, the team migrated playback-init to a custom HPA reading p99 from a Prometheus query against the service's HdrHistogram-backed metric. The new HPA's control law: desired_replicas = current_replicas × (observed_p99 / target_p99)^0.6, with the exponent damped to 0.6 to prevent oscillation. Below ρ ≈ 0.6 the formula has minimal effect (replicas stay roughly constant); above ρ ≈ 0.85 it reacts strongly (replicas grow by 30–60% per tick).

The post-mortem made the discipline explicit. "Auto-scaling targets the metric you scale on, not the metric you care about. The two are the same only when the workload is purely CPU-bound. For every other workload, scale on the user-facing metric — p99 latency, error rate, queue depth — and let the relationship to CPU emerge as a consequence rather than a constraint." The 2024 IPL final ran out of incident at 03:14 IST after the third Mumbai Indians match; the 2025 IPL final hit 31M concurrent and never crossed the playback-init SLO, because the latency-driven HPA had been in place for 11 months by then.

The CPU curve barely moves through the spike — the workload is I/O-bound, so the cores are idle while the requests queue. The CPU-driven HPA does nothing until manual intervention at t=120s. The counterfactual latency-driven HPA (modelled from the 2025 deployment's actual response shape) reacts within 30 seconds and brings p99 back to SLO at t=90s. Illustrative; the curve shapes match the SetuStream 2024-05-26 post-mortem timeline.

How the control loop is actually wired

Latency-driven auto-scaling needs three production primitives that CPU-driven scaling does not need. Each is a small piece of work that has to be in place before the loop closes.

Per-service HdrHistogram with sliding-window decay. The service exposes a Prometheus summary or histogram of request latency at the appropriate percentile (p99, p99.9). Counter-style histograms with linear buckets do not work — the long tail needs HdrHistogram-style logarithmic bucketing to capture p99.9 with bounded error. The SetuStream 2025 deployment uses the prometheus_client Python library's Histogram with custom logarithmic buckets at [10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s]. A 60-second sliding window with exponential decay (half-life 20s) gives the autoscaler a stable signal that still reacts to real load shifts.

A control law that doesn't oscillate. The naive law (if p99 > target: double the cluster; if p99 < target/2: halve it) oscillates badly. Add 50 pods, p99 falls to 80 ms, halve the cluster, p99 climbs back, double again. The damped-power-law form desired = current × (observed / target)^k with k ∈ [0.4, 0.7] is the production fix. At k = 0.6, doubling the latency triggers a 2^0.6 ≈ 1.52× scale-up — aggressive enough to move the operating point, conservative enough to avoid overshoot. The PaisaBridge payment-init team uses k = 0.55; ParakhTrade Kite's quote-fetch uses k = 0.65 because they tolerate more aggressive scaling at market open.

Cooldown and ramp constraints. Kubernetes pod startup time is 15–90 seconds depending on the image. The auto-scaler must respect this by capping desired - current per tick (SetuStream's cap: max 30% growth per 30s tick) and waiting for new pods to become Ready before counting them. Without these constraints, the loop fires repeatedly while the previous decision is still landing, producing pod-count overshoot followed by unnecessary scale-in. The PaisaBridge 2024 deployment uses a 45-second cooldown after each scale action; SetuStream's IPL configuration drops this to 20 seconds during anticipated spikes (toss, end-of-innings, super-over) and goes back to 60 seconds in steady state.

The three primitives compose into a closed-loop control system: HdrHistogram measurement, damped-power-law actuation, cooldown for stability. Missing any of the three produces a known failure mode: missing the histogram produces wrong p99 numbers, missing the damping produces oscillation, missing the cooldown produces overshoot. The control-theory literature has names for each (the third is integrator wind-up; the second is dead-beat response avoidance), but the production engineer needs to know they are all required, not which textbook chapter to consult.

Common confusions

"Latency-driven auto-scaling is the same as KEDA." KEDA is a Kubernetes mechanism for scaling on custom metrics. Latency-driven auto-scaling is a control law for a specific metric. KEDA is necessary but not sufficient — you can install KEDA and configure it to scale on p99 with the wrong control law (constant threshold, no damping) and produce a worse system than CPU-driven HPA. The discipline is in the law, not the plumbing. KEDA is Pattern 1 of three (the others are custom HPA controllers and cluster-autoscaler webhooks); the law that runs inside is what determines whether the loop converges.
"Just lower the CPU target from 70% to 50%." Lowering the CPU target shrinks the danger band but does not eliminate it. The CPU and p99 curves have different shapes (smooth vs cliff), and no static CPU threshold tracks the cliff exactly. Lowering the target also wastes capacity at low load — a service running at 25% CPU with a 50% target will scale-in to half its replicas, then scale-out the moment a small spike arrives, oscillating constantly. The CPU target's job is to prevent saturation; the latency target's job is to track the SLO; they are not interchangeable.
"Latency-driven scaling needs ML." It does not. The damped-power-law control law is six lines of Python and works as well as any of the ML-based auto-scalers (Predictive Auto-Scaling, FBProphet-driven schedulers) for 95% of workloads. ML helps when you have a predictable schedule (the IPL match starts at 19:30) and want to pre-scale before the spike. For reactive scaling driven by the metric the SLO is written against, classic control theory wins on simplicity, debuggability, and correctness. Reach for ML when you have a forecasting problem, not a closed-loop control problem.
"You can't auto-scale on p99 because p99 is too noisy." The naive p99 over a 5-second window is too noisy. The HdrHistogram-backed p99 over a 60-second sliding window with exponential decay is less noisy than CPU utilisation by most measurements (p99 over 60s has σ/μ ≈ 0.08 in steady state; CPU utilisation has σ/μ ≈ 0.15). The "p99 is noisy" reflex comes from people who tried to scale on raw p99 without windowing; the discipline is the windowing, not avoiding p99.
"Latency-driven scaling means you need a different control law for each service." No — the damped-power-law form generalises across services. The exponent k and the target latency change per service, but the law is the same. The PaisaBridge platform team ships a generic LatencyDrivenScaler Helm chart that takes target_p99_ms and damping_k as parameters; every service that uses it gets the same control law with service-specific tuning. The "different law per service" failure mode comes from teams that hand-roll the scaler each time and end up with subtly-different oscillation behaviours.
"Scale on p50, not p99 — it's less noisy." The metric you scale on must be the metric the SLO is written against. If your SLO is "p99 < 250 ms", scaling on p50 means you are explicitly not scaling against the SLO; you might keep p50 healthy while p99 lives on a cliff. Scaling on p50 is auto-scaling against the median user, but the SLO is about the 99th-percentile user. Pick one or the other; do not paper over the difference. The ClearJourney 2025 fare-search team learned this when their p50-driven scaler held the cluster at 32 pods through a spike that pushed p99 from 180 ms to 1.4 seconds — p50 stayed at 65 ms throughout, and the scaler thought the cluster was healthy.

Why a damping exponent k between 0.4 and 0.7 is correct: the closed-loop transfer function of desired = current × (observed/target)^k against the queueing-theoretic plant latency ∝ 1/(1 - ρ) has its dominant pole at k. For k > 1 the loop is over-aggressive and oscillates (each scale action overshoots the operating point). For k < 0.3 the loop is sluggish and takes minutes to react. The sweet spot at k ≈ 0.5 corresponds to half-step-toward-target per tick — fast enough to recover from a 3× spike in two ticks, slow enough that a transient outlier doesn't cascade. The number is robust across workloads because it depends on the loop dynamics, not the workload.

Going deeper

The relationship to USL and the cluster's serial fraction

The Universal Scalability Law (Gunther, Guerrilla Capacity Planning) gives a tighter bound than M/M/c on the operating envelope. Throughput X(N) = N·λ / (1 + α·(N-1) + β·N·(N-1)), where α is the serial fraction and β is the cross-replica coherence cost. For α = 0.05 and β = 0.001, peak throughput is at N ≈ 22 replicas; adding more produces less throughput because coherence overhead dominates. A latency-driven auto-scaler that doesn't know about USL will happily scale past this peak — adding replicas while p99 climbs because each new replica makes the cluster slower.

The SetuStream 2025 deployment caps the auto-scaler's max replicas at the USL-derived peak (computed from a load test sweep at deployment time). When p99 exceeds target at the cap, the auto-scaler raises an alert (auto_scaling_capped_alert) and the on-call team knows the cluster has hit its scaling limit — the fix is not more replicas but better per-request efficiency (less I/O, smaller payload, batched RPCs). The alert fires roughly twice a year on the playback-init service, both times during IPL super-overs. The alert is the cap doing its job: scaling out farther would have made things worse.

Multi-dimensional scaling: latency, queue depth, error rate

A single-metric auto-scaler is brittle. The PaisaBridge 2024 production scaler reads three metrics and scales on the worst-performing one. The control law: desired = current × max((p99/p99_target)^0.55, (queue_depth/queue_target)^0.6, (error_rate/error_target)^0.7). Each metric covers a different failure mode — p99 for user-felt latency, queue depth for upstream pressure that hasn't yet manifested as latency, error rate for downstream failures that are dragging requests slower than the histogram captures. The max combines them: any one metric being out of bounds triggers scale-out, which protects against single-metric blind spots.

The cost of multi-dimensional scaling is in the dashboard. When the auto-scaler scales out, the SRE has to know which metric drove the decision. The PaisaBridge scaler emits scaler_decision_reason as a label on every scale event: "p99", "queue_depth", "error_rate". The histogram of decision reasons over a week tells the team whether the scaler is reacting to latency (most common during normal operation), queue depth (more common during DR drills), or errors (which usually means a downstream is unhealthy and the scaler is masking the real problem). The metric mix is itself a diagnostic signal.

Pre-warming for predictable spikes

Reactive auto-scaling has a fundamental floor: pod startup time. A pod takes 15–90 seconds to be Ready. During those 15–90 seconds, requests arriving at the existing replicas are queueing, and the user feels every millisecond of it. For predictable spikes — IPL toss at 19:30, market open at 09:15, Tatkal at 10:00 — the answer is pre-warming: scaling up the cluster before the spike, not in response to it.

The SetuStream 2025 IPL playbook pre-warms the playback-init cluster from 240 pods to 480 pods 90 seconds before the toss completes (the toss is itself broadcast on the platform, so the timing is known to the millisecond). The pre-warm is implemented as a CronJob that nudges the HPA's minReplicas up at the right time. The latency-driven HPA continues to operate during the pre-warmed window, and if the spike is bigger than expected (which happens — the 2025 final's first-ball spike was 8% bigger than predicted), the HPA scales further. Pre-warming and reactive scaling compose; the pre-warm is an initial condition, the HPA's control law is the feedback.

The ParakhTrade Kite team pre-warms differently. Their spike at 09:15 IST (cash equity market open) is 12× steady-state in five seconds — too fast for any reactive scaling to keep up. The pre-warm is from 80 pods at 09:00 to 1200 pods by 09:14:30, ramping in over 14 minutes via a CronJob-driven HPA minReplicas schedule. The latency-driven HPA is disabled during the open window (09:14:30 to 09:16:00) because reactive scaling would chase the spike's noise; it re-enables at 09:16:00 once the cluster has settled. This kind of fixed-schedule scaling is sometimes called "scheduled" or "cron-driven" auto-scaling, and it complements rather than replaces latency-driven scaling for workloads where the spike timing is precisely known and reactive scaling is too slow.

When latency-driven scaling makes things worse

Latency-driven scaling fails in three specific regimes. First, when the latency is dominated by a downstream that is itself overloaded — scaling out the front-end adds more pressure to the downstream, which makes latency worse, which makes the scaler add more pods, which makes downstream worse. The ClearJourney 2024 fare-search incident was exactly this: scaling fare-search out by 4× pushed the GDS gateway from healthy to overloaded, and fare-search latency climbed instead of falling. The fix is coupled scaling — the auto-scaler reads the downstream's queue depth and refuses to scale out when the downstream is hot. The ClearJourney 2025 deployment uses the bounded-queueing primitive from /wiki/backup-requests-and-bounded-queueing: scale-out is gated on downstream queue-depth-below-bound.

Second, when the workload has a memory or state bottleneck that scaling doesn't address. A service with a per-pod cache that takes 15 minutes to warm up will see latency spike immediately after a scale-out (new pods are cold), then recover as the caches fill. A naive latency-driven scaler will see the post-scale-out spike and scale out again, producing a cascade of cold pods. The fix is to weight new pods less in the latency aggregation — SetuStream's deployment ignores the first 90 seconds of any new pod's metrics, treating it as a warmup window. After 90 seconds the new pod's cache is warm and its latency is comparable to the existing pods.

Third, when the latency target is set below what the workload can achieve at any scale. A team that sets target_p99 = 50ms for a workload whose minimum-load p99 is 80 ms will see the auto-scaler scale to its maxReplicas cap and stay there forever, alerting continuously on auto_scaling_capped. The fix is to derive the target from a load-test that establishes the floor (minimum_p99_at_low_load + safety_margin), not from product wishes. PaisaBridge's platform tooling refuses to deploy an HPA whose target_p99 is below the service's measured floor + 25%; the deployment fails with a clear error rather than silently producing an always-capped scaler.

Reproduce this on your laptop

# About 2 minutes runtime including the CPU-vs-latency comparison.
python3 -m venv .venv && source .venv/bin/activate
pip install simpy hdrh

# Run the full simulation from this chapter
python3 latency_autoscaler.py

# Sweep service-time variance to see when CPU-driven auto-scaling fails
python3 -c "
import simpy, random
from hdrh.histogram import HdrHistogram
RNG = random.Random(53)
def run(sigma, mode):
    env = simpy.Environment()
    pods = [simpy.Resource(env, capacity=1) for _ in range(20)]
    in_flight = [0]*20; busy = [0.0]*20
    h = HdrHistogram(1, 30_000, 3); win = HdrHistogram(1, 30_000, 3)
    def serve(idx, st):
        in_flight[idx]+=1
        with pods[idx].request() as q: yield q
        s = RNG.lognormvariate(3.0, sigma)/1000
        busy[idx] += 0.18*s; yield env.timeout(s)
        in_flight[idx]-=1; e=int((env.now-st)*1000); h.record_value(e); win.record_value(e)
    def wl():
        for sec in range(300):
            for _ in range(900 if sec>=60 else 250):
                yield env.timeout(1.0/(900 if sec>=60 else 250))
                i = min(range(len(pods)), key=lambda x: in_flight[x])
                env.process(serve(i, env.now))
    def asc():
        nonlocal_pods = [20]
        last=-30
        while True:
            yield env.timeout(15)
            if env.now-last<30: continue
            if mode=='cpu':
                cpu = sum(busy)/15/len(pods)*100
                for i in range(len(pods)): busy[i]=0
                if cpu>70:
                    for _ in range(int(len(pods)*0.3)):
                        pods.append(simpy.Resource(env, capacity=1)); in_flight.append(0); busy.append(0)
                    last=env.now
            else:
                if win.get_total_count()<50: continue
                p99 = win.get_value_at_percentile(99); win.reset()
                if p99>250:
                    for _ in range(int(len(pods)*0.4)):
                        pods.append(simpy.Resource(env, capacity=1)); in_flight.append(0); busy.append(0)
                    last=env.now
    env.process(wl()); env.process(asc()); env.run(until=300)
    return len(pods), h.get_value_at_percentile(99)
for sigma in [0.3, 0.5, 0.7, 0.9]:
    for mode in ['cpu','latency']:
        n, p99 = run(sigma, mode)
        print(f'sigma={sigma} mode={mode:>7}  pods={n:>3}  p99={p99} ms')
"

You will see a regime boundary clearly: at low service-time variance (sigma=0.3), CPU-driven and latency-driven scaling produce similar outcomes. As variance grows (sigma=0.7, sigma=0.9), the CPU-driven scaler falls further behind — at sigma=0.9, the CPU-driven cluster's p99 is 4–8× higher than the latency-driven cluster's p99 with the same offered load. The exercise teaches the discipline: the higher the service-time variance, the more the danger band hurts, the more wrong CPU is as a scaling signal.

Where this leads next

The next chapters extend the latency-driven control loop into adjacent capacity-planning topics. Predictive auto-scaling for known spikes is the pre-warming complement: rather than reacting to latency, the auto-scaler reads a forecast (IPL toss schedule, BharatRail Tatkal time, Diwali sale) and pre-warms the cluster. The two compose — predictive scaling sets the initial condition, latency-driven scaling provides the feedback during the spike — and the combination handles both predictable and surprise traffic.

Capacity headroom and the queueing knee is the upstream meta-chapter on how much capacity to provision in steady state. The latency-driven auto-scaler determines when to scale out; the headroom budget determines how much baseline capacity to run. The two interact: a cluster with 10% headroom needs an aggressive auto-scaler with a fast cooldown; a cluster with 40% headroom can tolerate a lazy auto-scaler with a long cooldown. The trade-off is cost vs incident probability.

Adaptive concurrency limits is the cluster-wide complement to per-replica scaling. While auto-scaling adds replicas, adaptive concurrency limits protect existing replicas from overload by rejecting requests at the gateway when the cluster's response-time gradient turns positive. The two work in tandem: the gateway sheds excess load while the auto-scaler adds capacity, both driven by the same closed-loop control philosophy applied at different layers.

The 80/20 rule for over-provisioning is the cost-side meta-chapter. The latency-driven auto-scaler assumes you can pay for the replicas it asks for. In practice, every team has a cost ceiling, and the meta-chapter walks through how to size the auto-scaler's maxReplicas against the ₹/month budget, how to allocate the budget across services by user-impact tier, and how to use spot instances and burstable pricing to soak the spikes cheaply.

The single architectural habit to take from this chapter: scale on the metric the SLO is written against, not the metric that's easy to read. CPU utilisation is the easy metric — every Kubernetes deployment exposes it, every cloud provider charges by it, every dashboard tool plots it by default. p99 latency is the hard metric — it requires HdrHistogram instrumentation, sliding-window aggregation, and a control law that doesn't oscillate. The temptation to scale on the easy metric is exactly the temptation to optimise for what's measured rather than what matters. The danger band in the figure is the cost of that temptation, and during the IPL final at 25M concurrent viewers, that cost is measured in lost subscriptions.

A second habit: instrument the auto-scaler's decisions as carefully as the auto-scaler's signal. Most production teams plot p99 latency and pod count side by side and call it a dashboard. The richer dashboard adds the auto-scaler's decision metric per tick (scaler_observed_p99, scaler_target_p99, scaler_proposed_replicas, scaler_actual_replicas, scaler_cooldown_remaining) and the reason for every scale action. When a post-mortem asks "why didn't the auto-scaler react", the decision-history dashboard is the answer: it shows whether the loop saw the metric, what it computed, what it decided, and why it didn't act. Without that dashboard, the post-mortem is a guessing game; with it, the diagnostic is a SQL query against the scaler's audit log.

A third habit, sharper: the auto-scaler is a closed-loop control system, and every closed-loop control system has a plant model — the assumption about how the system behaves when you change the input. The CPU-driven auto-scaler's plant model is "CPU = f(load), latency = g(CPU)" — load determines CPU, CPU determines latency. That model is wrong for I/O-bound workloads, and the CPU-driven auto-scaler's failure mode is a direct consequence of the wrong plant model. The latency-driven auto-scaler's plant model is "latency = h(replicas, load)" — replicas and load together determine latency, with no intermediate. That model is right because it makes no assumption about how latency depends on load — only that adding replicas at fixed load reduces latency. The discipline is to know your plant model and to verify it. Every service that uses an auto-scaler should have a load test that confirms latency falls when replicas rise — and if it doesn't, the auto-scaler will not work, and no amount of tuning will fix it.

References

Jeffrey Dean and Luiz Barroso, "The Tail at Scale" (CACM 2013) — the foundational paper on why CPU is the wrong metric for tail-latency-sensitive systems and why scaling decisions must read the user-facing metric directly.
Streamora Tech Blog, "Performance Under Load" (2018) — the concurrency-limits library, the gradient-2 algorithm, and the production reference for closed-loop control of cluster capacity.
Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2 §2.6 — the canonical text's treatment of the queueing knee, the response-time curve, and why CPU saturation lags queueing saturation.
Neil Gunther, Guerrilla Capacity Planning (2007) — the Universal Scalability Law and the practical capacity-planning math that determines auto-scaling caps.
Kubernetes HorizontalPodAutoscaler design proposal — the canonical design document for the CPU-driven HPA, including the explicit acknowledgement that custom metrics (latency, queue depth) are required for non-CPU-bound workloads.
KEDA: Kubernetes Event-driven Autoscaling — the production plumbing for scaling on Prometheus-exposed p99 metrics, including the sliding-window query patterns the latency-driven HPA depends on.
/wiki/backup-requests-and-bounded-queueing — the previous chapter; the bounded-queueing discipline composes with latency-driven scaling to handle downstream-coupling failure modes.
/wiki/coordinated-omission-revisited — the measurement chapter that ensures the p99 the auto-scaler reads is actually p99, not the open-loop fiction that coordinated omission produces.