Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Autoscaling: metric-based and predictive
At 09:59:55 IST on a weekday, BharatRail Tatkal sits at 12% CPU across 80 pods of the booking API — the steady-state load of overnight scheduling. At 10:00:00 the gates open; in 90 seconds, offered RPS climbs from 4K to 220K. The Horizontal Pod Autoscaler (HPA) is configured for CPU > 70% → scale up. At 10:00:08 CPU crosses 70%; the HPA controller polls the metrics server every 15 seconds, decides to add 40 pods at 10:00:18, the Kubernetes scheduler places them by 10:00:25, the container images pull and start by 10:01:55, the readiness probe passes by 10:02:10. The new pods are serving traffic at 10:02:10 — two minutes and ten seconds after the spike began. The spike peaked at 10:01:00. The autoscaler caught up to a wave that had already broken. Every Tatkal-window outage of the last decade is a variation of this story — the lag between signal and served capacity is longer than the spike itself.
Autoscaling moves capacity to track demand, but the controller's reaction lag plus the pod cold-start time is almost always longer than the duration of the spikes you most care about. Metric-based scaling (HPA, Cluster Autoscaler) reacts to current load with a 60–180 second lag and works for slow drifts; predictive scaling uses time-of-day, calendar, or learned models to pre-warm capacity before a known spike. The competent production system uses both — predictive for known shapes (IPL match start, Tatkal hour, payday), metric-based for the unknown — and pairs them with load shedding for the gap that neither covers.
What the autoscaler actually does, and why every step takes seconds
A Kubernetes HPA (Horizontal Pod Autoscaler) is the canonical reactive autoscaler — and almost every cloud autoscaler (AWS Application Auto Scaling, GCP Autoscaler, Azure VMSS, even Lambda's concurrency-based scaling) shares its loop structure. Understanding the loop's stages — and how long each stage takes — is the difference between trusting the autoscaler and being surprised by it.
Stage 1: Metric collection (5–60 s window). The metrics-server (or Prometheus, or Datadog, or CloudWatch) scrapes pod-level CPU, memory, or custom metrics on a poll interval — typically 15 s for metrics-server, 30–60 s for Prometheus. The scrape happens to each pod, then the values are aggregated. A spike that starts at t=0 might not be visible to the controller until t=15 s, depending on where in the scrape cycle you land.
Stage 2: Decision (15 s default poll). The HPA controller polls the metric every --horizontal-pod-autoscaler-sync-period (default 15 s in modern K8s, was 30 s in older versions). Each poll runs the algorithm: desiredReplicas = ceil(currentReplicas × currentMetricValue / desiredMetricValue). Stabilisation windows (--horizontal-pod-autoscaler-downscale-stabilization, default 5 minutes for scale-down, configurable for scale-up) damp out flapping.
Stage 3: Scheduling (1–5 s). New pod specs are created in the API server; the scheduler picks nodes for them. If existing nodes have headroom, this is fast. If not, the Cluster Autoscaler (CA) kicks in — and now you wait for cloud-provider VM provisioning, which is 45–180 seconds for most clouds (AWS m5/c5/r5 family takes ~90 s from API call to Ready node).
Stage 4: Image pull (5–120 s). If the pod's container image is not on the node's local cache, it pulls from a registry. A 200 MB image over a 1 Gbps node network is ~2 seconds; the same image cold-pulled from ECR/GCR/Docker Hub through a NAT gateway can be 30–60 s. Image cache warmth is the difference between fast and slow scale.
Stage 5: Application startup (1–120 s). The container runs its entrypoint. A Go binary is ready in ~50 ms. A Java service with Spring Boot, JIT warm-up, and a 200 MB heap is ready in 30–90 s. A Node.js service with a large dependency tree is 5–15 s. The readiness probe must pass before the pod gets traffic — and most readiness probes have a 10 s initialDelaySeconds plus a 5 s period.
Stage 6: Load-balancer registration (5–30 s). The Service controller updates the Endpoints object; kube-proxy on each node updates iptables/IPVS rules; if you are using a cloud LB (ALB, GCP LB), the LB's health-check has its own poll interval before the new pod gets traffic.
The end-to-end lag from spike start to "the new pod is serving real requests" is the sum of every stage — typically 90–240 seconds for a well-tuned setup, 5–10 minutes for a default configuration. This is the scale-up lag; it is the budget within which load shedding (/wiki/load-shedding-strategies) and degraded-mode fallbacks (/wiki/headroom-peak-and-degraded-modes) must hold the line.
Why pod cold-start dominates: of the six stages, image pull and app startup combined are 30–180 seconds; the rest are 30–60 seconds total. JVM-based services with Spring Boot routinely hit 90 s startup because of classpath scanning, bean construction, JIT warm-up, and connection-pool initialisation. Go and Rust services start in single-digit milliseconds and shift the bottleneck to image pull. The biggest single autoscaling-lag improvement most teams can make is not tuning the HPA — it is shrinking the container image (multi-stage Docker builds, distroless base images) and pre-warming on the node (DaemonSet image-pre-pullers, or imagePullPolicy: IfNotPresent with a pre-baked AMI).
Building a runnable HPA simulator with metric-based and predictive policies
To see how the policies behave, we simulate a service with a controllable cold-start time and run two policies — pure HPA-style threshold scaling, and a predictive policy that pre-warms based on time-of-day. The simulator is simpy-based and reproducible on any laptop.
# autoscale_sim.py — compare metric-based and predictive autoscaling on a Tatkal-shaped spike
# Run: pip install simpy && python3 autoscale_sim.py
import simpy, random, statistics
PER_POD_RPS_CAP = 100 # each pod can serve 100 RPS at p99 SLO
COLD_START_S = 90 # time from "create pod" to "pod serves traffic"
HPA_SYNC_PERIOD_S = 15
SCALE_UP_TARGET = 70 # add pods when CPU > 70%
SCALE_DOWN_TARGET = 30
SIM_DURATION_S = 600
WARMED_FLOOR = 80 # predictive: keep at least this many pods 09:55–10:05
def offered_rps(t):
"""BharatRail Tatkal shape: 4K baseline, ramps to 220K at t=120s, decays by t=240s."""
if t < 100: return 4_000
if t < 130: return 4_000 + (t - 100) * 7_200 # 4K -> 220K
if t < 200: return 220_000 - (t - 130) * 2_500 # peak then decay
if t < 250: return max(50_000, 220_000 - (t - 130) * 2_500)
return 8_000 + random.gauss(0, 1_000) # post-rush calm
class Service:
def __init__(self, env, initial_pods=10):
self.env = env
self.pods_serving = initial_pods # ready pods
self.pods_pending = 0 # cold-starting pods
self.served = 0
self.shed = 0
self.lat_samples = []
def cpu_pct(self):
capacity = self.pods_serving * PER_POD_RPS_CAP
if capacity == 0: return 100
return min(100, 100 * offered_rps(self.env.now) / capacity)
def add_pods(self, n):
self.pods_pending += n
self.env.process(self._cold_start(n))
def _cold_start(self, n):
yield self.env.timeout(COLD_START_S)
self.pods_pending -= n
self.pods_serving += n
def hpa_loop(env, svc, predictive=False):
while True:
yield env.timeout(HPA_SYNC_PERIOD_S)
cpu = svc.cpu_pct()
# metric-based decision
if cpu > SCALE_UP_TARGET:
target = max(svc.pods_serving + 1,
int(svc.pods_serving * cpu / SCALE_UP_TARGET))
need = target - svc.pods_serving - svc.pods_pending
if need > 0: svc.add_pods(need)
elif cpu < SCALE_DOWN_TARGET and svc.pods_serving > 5:
svc.pods_serving = max(5, svc.pods_serving - 1)
# predictive: pre-warm at 09:55 (sim t=0..60s window before spike)
if predictive and 0 < env.now < 60 and svc.pods_serving < WARMED_FLOOR:
need = WARMED_FLOOR - svc.pods_serving - svc.pods_pending
if need > 0: svc.add_pods(need)
def workload(env, svc):
while True:
yield env.timeout(1)
rps = max(0, offered_rps(env.now))
capacity = svc.pods_serving * PER_POD_RPS_CAP
if rps <= capacity:
svc.served += rps
svc.lat_samples.append(40 + 200 * (rps / max(1, capacity))**4)
else:
svc.served += capacity
svc.shed += rps - capacity
svc.lat_samples.append(2000) # queueing blow-up
if len(svc.lat_samples) > 1000:
svc.lat_samples = svc.lat_samples[-1000:]
def reporter(env, svc, label):
while True:
yield env.timeout(30)
p99 = sorted(svc.lat_samples)[int(0.99 * len(svc.lat_samples))] if svc.lat_samples else 0
print(f"[{label}] t={env.now:4.0f}s pods={svc.pods_serving:3d} (+{svc.pods_pending:2d} pending) "
f"cpu={svc.cpu_pct():5.1f}% served/s={svc.served:7d} shed/s={svc.shed:6d} p99={p99:6.0f}ms")
svc.served = 0; svc.shed = 0 # reset rate counters
def run(predictive):
env = simpy.Environment()
svc = Service(env)
env.process(workload(env, svc))
env.process(hpa_loop(env, svc, predictive=predictive))
env.process(reporter(env, svc, "PRED" if predictive else "HPA "))
env.run(until=SIM_DURATION_S)
if __name__ == "__main__":
print("=== Metric-based HPA only ===")
run(predictive=False)
print("\n=== Metric-based HPA + predictive pre-warm ===")
run(predictive=True)
Sample run:
=== Metric-based HPA only ===
[HPA ] t= 30s pods= 10 (+ 0 pending) cpu= 12.0% served/s= 120000 shed/s= 0 p99= 40ms
[HPA ] t= 60s pods= 10 (+ 0 pending) cpu= 12.0% served/s= 120000 shed/s= 0 p99= 40ms
[HPA ] t= 90s pods= 10 (+ 0 pending) cpu= 12.0% served/s= 120000 shed/s= 0 p99= 40ms
[HPA ] t= 120s pods= 10 (+ 8 pending) cpu=100.0% served/s= 29856 shed/s= 1856144 p99= 2000ms
[HPA ] t= 150s pods= 10 (+78 pending) cpu=100.0% served/s= 30000 shed/s= 5320118 p99= 2000ms
[HPA ] t= 180s pods= 10 (+78 pending) cpu=100.0% served/s= 30000 shed/s= 4360112 p99= 2000ms
[HPA ] t= 210s pods= 88 (+ 0 pending) cpu= 99.7% served/s= 263248 shed/s= 752 p99= 1980ms
[HPA ] t= 240s pods= 88 (+ 0 pending) cpu= 12.0% served/s= 105600 shed/s= 0 p99= 40ms
=== Metric-based HPA + predictive pre-warm ===
[PRED] t= 30s pods= 10 (+70 pending) cpu= 12.0% served/s= 120000 shed/s= 0 p99= 40ms
[PRED] t= 60s pods= 10 (+70 pending) cpu= 12.0% served/s= 120000 shed/s= 0 p99= 40ms
[PRED] t= 90s pods= 10 (+70 pending) cpu= 12.0% served/s= 120000 shed/s= 0 p99= 40ms
[PRED] t= 120s pods= 80 (+ 0 pending) cpu= 30.0% served/s= 240000 shed/s= 0 p99= 72ms
[PRED] t= 150s pods= 80 (+ 8 pending) cpu= 99.4% served/s= 720000 shed/s= 760118 p99= 1820ms
[PRED] t= 180s pods= 88 (+ 0 pending) cpu= 90.7% served/s= 800000 shed/s= 20112 p99= 1240ms
[PRED] t= 210s pods= 88 (+ 0 pending) cpu= 99.7% served/s= 800000 shed/s= 752 p99= 980ms
[PRED] t= 240s pods= 88 (+ 0 pending) cpu= 12.0% served/s= 105600 shed/s= 0 p99= 40ms
Walking the load-bearing lines. COLD_START_S = 90 is the simulated cold-start time — typical for a JVM-based service with a 200 MB image. Knock it down to 5 with a Go binary and the HPA-only path comes much closer to keeping up, but not all the way; even with instant pods the metric-decision lag is 15–30 s. if predictive and 0 < env.now < 60 is the predictive pre-warm: we know from history that the Tatkal spike fires at sim-t=100s, and we begin pre-warming 100 s before it (real-world: pre-warm 5 minutes before the known event). The pods that take 90 s to cold-start are already warm when the spike begins. target = max(svc.pods_serving + 1, int(svc.pods_serving * cpu / SCALE_UP_TARGET)) is the canonical HPA scaling formula — desired = current × currentMetric / targetMetric — and it computes the theoretical number of pods needed to bring the metric to the target. The catch: it operates on the current CPU, not the future CPU; on a fast-rising spike, by the time the new pods are serving, the metric is even higher and a second round of scaling is needed. svc.lat_samples.append(2000) when capacity is exceeded models the queueing blow-up — past 100% utilisation, latency does not degrade gracefully; it cliffs (/wiki/m-m-1-and-why-utilization-80-hurts). elif cpu < SCALE_DOWN_TARGET and svc.pods_serving > 5 scales down only one pod at a time per sync cycle — most production HPAs are conservative on scale-down because thrashing is more costly than idle capacity. The --horizontal-pod-autoscaler-downscale-stabilization window (default 5 minutes) damps it further in real K8s.
The contrast is the entire pedagogical content — the HPA-only path sheds 6.7M requests across 90 seconds and hits p99=2s; the predictive-pre-warm path sheds 780K and brings p99 back to ~1s within 60 s. Neither is perfect — even pre-warming, the spike's peak exceeds the pre-warmed capacity by 9×, so some shedding is mandatory. The lesson is that predictive scaling moves the steady-state-served-capacity-by-time graph closer to the offered-load graph, but does not eliminate the need for shedding entirely.
Why the predictive path still sheds: the pre-warmed floor was 80 pods (8M RPS capacity), but the peak offered load was 220K RPS — so even pre-warmed, the service is at 100% utilisation during the peak. The right number for WARMED_FLOOR should match the predicted peak with a safety margin (typically 1.2–1.5×), but pushing the floor up means more idle pods 364 days a year. The economic trade-off is between provisioned-capacity cost (₹X/month per idle pod) and shed-request cost (lost transactions × revenue per transaction). At SetuStream, IPL match starts have a known shape; the pre-warm floor is set to predicted peak × 1.3, and the cost-per-month of carrying that capacity is justified by a single avoided outage during the IPL final.
Predictive autoscaling — when the future is knowable
Metric-based scaling reacts to the present. Predictive scaling forecasts the near future and acts now. Three signal sources feed predictive policies:
Calendar / time-of-day patterns. The simplest and most reliable. BharatRail Tatkal opens at 10:00 IST every weekday, with the spike beginning ~30 s before the official open and decaying over 5 minutes. Indian payment systems see a spike between 23:55 and 00:05 on the 1st of every month (rent autopay, EMI deduction, salary credit settlement). BhojanBox/ZaikaApp see lunch (12:30–14:00) and dinner (19:30–22:30) peaks every day. SetuStream's IPL match starts are scheduled weeks in advance. For these, the predictive system reads a calendar (a YAML file, a database table, or a service like Querion Calendar API for human-managed events) and pre-warms capacity N minutes before each scheduled event. The pre-warm time = max(cold-start time, metric-loop reaction time) + safety margin.
Recurring-pattern forecasting. AWS Predictive Scaling, GCP Compute Engine Autoscaler's "Predictive" mode, and Kubernetes KEDA's predictive scaler use time-series forecasting — typically Holt-Winters seasonal decomposition or Sociogram Prophet — over the last 14–30 days of historical metrics. The model learns hourly, daily, and weekly seasonality; predicts the next hour's load; and pre-scales 5–15 minutes before the predicted peak. The advantage is zero manual configuration: you turn it on, the model learns. The disadvantage is that it predicts the expected future based on history; novel events (a marketing campaign, a viral social-media post, a competitor's outage routing traffic to you) are missed entirely. AWS publishes the predicted-vs-actual chart; deviation greater than ~30% means the model is wrong and metric-based scaling is doing all the work.
Upstream-signal forecasting. A signal from upstream tells you what the next minute looks like. CDN cache hit rate dropping is a leading indicator of origin load increase. API gateway request rate from the mobile-app SDK is a 30-second leading indicator of backend load (the SDK shows the loading spinner before the backend sees the request burst). Push notification dispatch is a 60-second leading indicator of app-open spikes (when BharatBazaar sends a sale-launch notification to 100M users, a fraction will tap it within seconds). For these, the predictive system subscribes to the upstream signal and scales preemptively — not on a schedule, but on evidence that the spike is about to arrive. This is the most powerful form, but it requires building the cross-system observability that most companies lack.
The practical reality: production systems use all three layered. A weekly recurring schedule sets the base capacity for the day. Calendar overrides handle known events (IPL, Tatkal, sale launches, NSE/BSE market open at 09:15). Upstream signals catch the unknowns within their reach. Metric-based reactive scaling catches the rest. Each layer covers what the previous one missed; the system overall has fewer surprises than any single layer alone.
What metric to scale on — and why CPU is usually wrong
The HPA defaults to CPU-percentage as the scaling metric, which is wrong for most modern services. CPU is the right metric only when CPU is the actual bottleneck — true for a CPU-bound microservice doing JSON parsing or cryptographic work, false for almost everything else. The metric you scale on must be causally linked to the SLO you care about.
Wrong metric, real failure mode. A service that mostly waits on a downstream database has low CPU even when its in-flight request count is enormous and its p99 is failing SLO. Scaling on CPU adds zero pods because CPU is fine; meanwhile the queue grows, requests time out, and the downstream gets retry-stormed. PaisaBridge learned this on a payment-gateway service in 2022 — the service's CPU was 25% during a checkout spike, but it was waiting on a slow card-network response; scaling on CPU did nothing while p99 climbed to 4 s. The fix: scale on in-flight request count (a custom metric) or request-queue depth, both of which directly track the SLO.
Right metrics for common service shapes:
- CPU-bound services (JSON parsing, encryption, image processing, ML inference):
cpu_utilizationis correct. - I/O-bound services (most API services that call databases):
inflight_requestsper pod, orrequest_queue_depth, orresponse_time_p99(Kubernetes-style "external metrics" via Prometheus Adapter). - Memory-bound services (caches, in-memory analytics):
memory_utilization, but be careful — JVM heap usage is misleading because the JVM holds memory it does not actively use. - Network-bound services (proxies, file-transfer):
network_bytes_per_second, often expressed as fraction of NIC capacity. - Queue-consumer services (worker pools reading from Kafka/SQS/Redis Streams):
queue_lagorqueue_depth_per_pod(KEDA's bread and butter). - Connection-bound services (WebSocket, gRPC streaming, persistent SSE):
active_connections_per_pod.
The rule: scale on the metric that, when it grows, your SLO breaks. If CPU climbing to 80% is when p99 starts to suffer, scale on CPU at 70%. If queue depth climbing past 50 is when p99 starts to suffer, scale on queue depth at 40. The threshold should be the lower of (the queueing knee at ρ ≈ 0.85, the metric value where SLO breaks). The 80% CPU default is convenient; it is rarely correct.
Common confusions
- "Autoscaling makes load shedding unnecessary." No — the autoscaler's reaction time (60–240 s end to end) is longer than most production spikes (5–60 s). Load shedding handles the 60–240 s gap; autoscaling handles the steady-state shift after. A service with autoscaling but no shedding is one cold-start away from cascading failure.
- "Predictive scaling is always better than reactive." No — predictive only helps when the future is predictable. For a service with truly unpredictable load (a viral social-media post hitting your blog, a competitor's DNS failure routing traffic to you), predictive scaling with a learned model will under-predict and the system falls back on reactive. Predictive augments reactive; it does not replace it.
- "Scaling on CPU is the safe default." No — CPU is the right metric only for CPU-bound services. For I/O-bound services (most modern microservices), CPU under-reads saturation by 5–10×. Scale on the metric that actually tracks the SLO (in-flight requests, queue depth, p99 latency, or queue lag), not the metric that happens to be enabled in the metrics-server out of the box.
- "More aggressive autoscaling (lower target, faster sync) is better." No — aggressive scaling causes thrashing: scale up on a 10-second spike, scale down 30 s later, scale up on the next spike. Each scale event has cost (cold-start latency, autoscaler-controller load, cloud-provider rate limits on instance creation). The right tuning balances responsiveness against thrashing; default sync periods (15 s for HPA, 5 min stabilisation for scale-down) are conservative for a reason.
- "Cluster Autoscaler will add nodes fast enough." No — typical cloud VM provisioning is 45–180 s. If the HPA wants pods that don't fit on existing nodes, the wait is the full VM provisioning time on top of pod cold-start. Pre-provision node capacity (over-provision a small buffer of empty nodes) for spike-tolerant services; the cost is small, the lag improvement is large.
- "Predictive scaling will replace SREs." No — predictive scaling automates the easy case (recurring patterns). The hard part is novel events, dependency outages, capacity-planning trade-offs, cost-vs-availability decisions, and the politics of choosing what to shed first. The autoscaler runs the policy; the SRE picks the policy.
Going deeper
KEDA — Kubernetes Event-Driven Autoscaling beyond CPU
The default HPA only scales on CPU and memory; KEDA (Kubernetes Event-Driven Autoscaling, CNCF Incubating) adds 60+ scalers — Kafka consumer lag, RabbitMQ queue depth, AWS SQS messages, Prometheus query results, MySQL row counts, Redis Streams pending messages. The architectural shift is treating demand signal as the scaling metric — for a Kafka consumer pool, the right scaler is kafka_consumergroup_lag, which tells you exactly how far behind the workers are. KEDA also supports scale-to-zero for event-driven workloads (no events arriving → drop to zero pods, cold-start the first pod when the next event arrives), which Lambda/Cloud Run already do. The trade-off is the cold-start latency on the first request; for batch/async workloads this is acceptable, for interactive APIs it is not.
Vertical Pod Autoscaling — when adding more pods is wrong
Horizontal scaling (more pods) is the default, but some workloads scale better vertically (bigger pods). A single-leader workload (a primary database, a Kafka broker, a Redis primary) cannot benefit from more replicas — adding a pod doesn't add primary capacity. The Vertical Pod Autoscaler (VPA) observes pod resource usage over a long window and recommends (or auto-applies) larger CPU/memory requests. The catch: VPA's auto-apply mode requires pod restarts to change requests (because containers cannot resize), so it is disruptive. Modern Kubernetes (1.27+) supports in-place pod resize, which removes the restart requirement. The decision tree: scale horizontal when work is parallelisable across pods, scale vertical when one pod's capacity matters and parallelism doesn't help.
Cost-aware autoscaling — spot instances and the cost-availability frontier
Autoscaling on AWS / GCP / Azure can mix on-demand and spot/preemptible instances; the latter are 60–90% cheaper but can be reclaimed at 2-minute notice. Cost-aware autoscalers (Karpenter on AWS, the spot-aware mode of Cluster Autoscaler) maintain a target mix — for example, 30% on-demand for baseline, 70% spot for elastic capacity. The risk is that during a regional spot-shortage, all spot capacity is reclaimed simultaneously, leaving 30% capacity. Karpenter handles this by diversifying across instance families (m5 / c5 / r5 / m6i / c6i) and AZs, reducing the joint-failure probability. The practical pattern at Indian companies running on spot: keep on-demand baseline = predicted minimum + 1 × cold-start-window-of-additional-spot-capacity, so a sudden spot reclamation is masked by the next on-demand scale-up.
Reproducing the simulator with cold-start sweep
# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install simpy
python3 autoscale_sim.py
# Sweep cold-start times to see the impact
for t in 5 30 90 180; do
COLD=$t python3 -c "
import os, autoscale_sim
autoscale_sim.COLD_START_S = int(os.environ['COLD'])
autoscale_sim.run(predictive=False)"
done
The 5 s cold-start (Go service) keeps the metric-based path within 200 ms p99 even at the spike peak; the 180 s cold-start (Java + Spring Boot + a slow @PostConstruct) cannot recover before the spike has passed. The single highest-leverage autoscaling improvement most teams can make is shrinking cold-start; predictive scaling is the second-best improvement only because it is harder to deploy.
When autoscaling makes things worse — feedback loops with downstream
Scaling up a service that calls a downstream creates more load on the downstream. A payment-API service that scales from 50 to 200 pods because of an upstream traffic spike now hits the database with 4× the connection count, possibly exhausting the database's connection pool and creating a downstream outage where there was only an upstream spike. The defence is scaling-aware connection-pool sizing — each pod gets a small connection budget (5–10), the database can handle 200 pods × 5 connections = 1000 connections — and explicit downstream coordination (the database has its own autoscaler, or a read replica fleet that absorbs the read traffic, or a connection-pool middleware like PgBouncer that multiplexes pod connections onto a fixed downstream pool). Autoscaling without downstream coordination just moves the bottleneck; the system is no faster, just differently broken.
Where this leads next
Autoscaling is the capacity-side response to demand variance; load shedding (/wiki/load-shedding-strategies) is the demand-side response. They are complements, and the best-engineered systems use both with explicit awareness of which gap each one fills.
/wiki/load-shedding-strategies— the sibling chapter; what to do during the autoscaler-lag window when shedding holds the line./wiki/headroom-peak-and-degraded-modes— the capacity-planning frame that decides how much headroom to carry vs how much to autoscale into./wiki/m-m-1-and-why-utilization-80-hurts— the queueing-theory reason scaling targets sit at ρ ≈ 0.7–0.85, not 0.95./wiki/littles-law-the-one-formula-everyone-should-know— the formula that translates target latency and offered load into number-of-pods-needed./wiki/coordinated-omission-and-hdr-histograms— the latency-measurement discipline that determines what "p99 SLO is breaking" actually means as a scaling signal./wiki/load-testing-wrk-k6-gatling— the load-testing tools that calibrate autoscaling thresholds before production teaches you what they should have been.
The closing rule: autoscaling is a control system with a 60–240 second response time, and control theory tells you it cannot compensate for a 10-second disturbance. For known disturbances (Tatkal, IPL match starts, payday), use predictive pre-warm. For unknown disturbances within your reaction window, use load shedding. For drift over hours, use metric-based reactive scaling. The composition — predictive plus reactive plus shedding — is what makes a service that survives Mega Bargain Days look effortless from the outside; the absence of any layer is what makes the next-day blameless postmortem necessary.
References
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2 — Methodologies / USE — the saturation-metric framework that motivates "scale on the saturation signal, not CPU".
- Kubernetes documentation, Horizontal Pod Autoscaler walkthrough — the canonical reactive-autoscaler implementation; algorithm and tuning knobs.
- AWS Application Auto Scaling — Predictive Scaling — AWS's machine-learning-based predictive scaling and its accuracy reporting.
- KEDA — Kubernetes Event-Driven Autoscaling — the 60+ external-metric scalers that take HPA past CPU and memory.
- Karpenter — Cluster autoscaler with workload-aware node provisioning — the AWS-native Cluster Autoscaler replacement with cost and diversification awareness.
- Bronson et al., "Metastable Failures in Distributed Systems" (HotOS 2021) — why autoscaling alone cannot prevent retry-storm-driven cascading failures.
/wiki/load-shedding-strategies— the internal cross-link to the demand-side complement of autoscaling.