Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Autoscaling: metric-based and predictive

At 09:59:55 IST on a weekday, BharatRail Tatkal sits at 12% CPU across 80 pods of the booking API — the steady-state load of overnight scheduling. At 10:00:00 the gates open; in 90 seconds, offered RPS climbs from 4K to 220K. The Horizontal Pod Autoscaler (HPA) is configured for CPU > 70% → scale up. At 10:00:08 CPU crosses 70%; the HPA controller polls the metrics server every 15 seconds, decides to add 40 pods at 10:00:18, the Kubernetes scheduler places them by 10:00:25, the container images pull and start by 10:01:55, the readiness probe passes by 10:02:10. The new pods are serving traffic at 10:02:10 — two minutes and ten seconds after the spike began. The spike peaked at 10:01:00. The autoscaler caught up to a wave that had already broken. Every Tatkal-window outage of the last decade is a variation of this story — the lag between signal and served capacity is longer than the spike itself.

Autoscaling moves capacity to track demand, but the controller's reaction lag plus the pod cold-start time is almost always longer than the duration of the spikes you most care about. Metric-based scaling (HPA, Cluster Autoscaler) reacts to current load with a 60–180 second lag and works for slow drifts; predictive scaling uses time-of-day, calendar, or learned models to pre-warm capacity before a known spike. The competent production system uses both — predictive for known shapes (IPL match start, Tatkal hour, payday), metric-based for the unknown — and pairs them with load shedding for the gap that neither covers.

What the autoscaler actually does, and why every step takes seconds

A Kubernetes HPA (Horizontal Pod Autoscaler) is the canonical reactive autoscaler — and almost every cloud autoscaler (AWS Application Auto Scaling, GCP Autoscaler, Azure VMSS, even Lambda's concurrency-based scaling) shares its loop structure. Understanding the loop's stages — and how long each stage takes — is the difference between trusting the autoscaler and being surprised by it.

Stage 1: Metric collection (5–60 s window). The metrics-server (or Prometheus, or Datadog, or CloudWatch) scrapes pod-level CPU, memory, or custom metrics on a poll interval — typically 15 s for metrics-server, 30–60 s for Prometheus. The scrape happens to each pod, then the values are aggregated. A spike that starts at t=0 might not be visible to the controller until t=15 s, depending on where in the scrape cycle you land.

Stage 2: Decision (15 s default poll). The HPA controller polls the metric every --horizontal-pod-autoscaler-sync-period (default 15 s in modern K8s, was 30 s in older versions). Each poll runs the algorithm: desiredReplicas = ceil(currentReplicas × currentMetricValue / desiredMetricValue). Stabilisation windows (--horizontal-pod-autoscaler-downscale-stabilization, default 5 minutes for scale-down, configurable for scale-up) damp out flapping.

Stage 3: Scheduling (1–5 s). New pod specs are created in the API server; the scheduler picks nodes for them. If existing nodes have headroom, this is fast. If not, the Cluster Autoscaler (CA) kicks in — and now you wait for cloud-provider VM provisioning, which is 45–180 seconds for most clouds (AWS m5/c5/r5 family takes ~90 s from API call to Ready node).

Stage 4: Image pull (5–120 s). If the pod's container image is not on the node's local cache, it pulls from a registry. A 200 MB image over a 1 Gbps node network is ~2 seconds; the same image cold-pulled from ECR/GCR/Docker Hub through a NAT gateway can be 30–60 s. Image cache warmth is the difference between fast and slow scale.

Stage 5: Application startup (1–120 s). The container runs its entrypoint. A Go binary is ready in ~50 ms. A Java service with Spring Boot, JIT warm-up, and a 200 MB heap is ready in 30–90 s. A Node.js service with a large dependency tree is 5–15 s. The readiness probe must pass before the pod gets traffic — and most readiness probes have a 10 s initialDelaySeconds plus a 5 s period.

Stage 6: Load-balancer registration (5–30 s). The Service controller updates the Endpoints object; kube-proxy on each node updates iptables/IPVS rules; if you are using a cloud LB (ALB, GCP LB), the LB's health-check has its own poll interval before the new pod gets traffic.

The end-to-end lag from spike start to "the new pod is serving real requests" is the sum of every stage — typically 90–240 seconds for a well-tuned setup, 5–10 minutes for a default configuration. This is the scale-up lag; it is the budget within which load shedding (/wiki/load-shedding-strategies) and degraded-mode fallbacks (/wiki/headroom-peak-and-degraded-modes) must hold the line.

The six stages of HPA + Cluster Autoscaler lag from spike to served capacitySix horizontal bars showing typical durations: metric collection 15s, decision 15s, scheduling 5s, image pull 30s, app startup 60s, LB registration 15s. Total 140s. Below: a spike timeline showing the spike peaking at t=60s while the autoscaler reaches steady state at t=140s, with the gap labelled "the load-shedding window".From "spike begins" to "new pod serves traffic" — every stage costs secondsSix stages of autoscale lag (typical, sum = 140 s)metric 15sdecide 15s5simage pull 30sapp startup 60s (often the longest)LB 15st=0 spiket=140s servedSpike vs. served capacity — the shedding windowoffered loadt=140s autoscaler caught upt=20s spike peak120-second window where shedding must hold
The autoscaler lag (top) is the sum of six independent stages, of which application startup is usually the longest. Bottom: a typical spike peaks long before the autoscaler reaches the new steady state — the gap is the window within which load shedding, queue-based admission control, or pre-warmed capacity must hold the line. Illustrative, durations vary by stack.

Why pod cold-start dominates: of the six stages, image pull and app startup combined are 30–180 seconds; the rest are 30–60 seconds total. JVM-based services with Spring Boot routinely hit 90 s startup because of classpath scanning, bean construction, JIT warm-up, and connection-pool initialisation. Go and Rust services start in single-digit milliseconds and shift the bottleneck to image pull. The biggest single autoscaling-lag improvement most teams can make is not tuning the HPA — it is shrinking the container image (multi-stage Docker builds, distroless base images) and pre-warming on the node (DaemonSet image-pre-pullers, or imagePullPolicy: IfNotPresent with a pre-baked AMI).

Building a runnable HPA simulator with metric-based and predictive policies

To see how the policies behave, we simulate a service with a controllable cold-start time and run two policies — pure HPA-style threshold scaling, and a predictive policy that pre-warms based on time-of-day. The simulator is simpy-based and reproducible on any laptop.

# autoscale_sim.py — compare metric-based and predictive autoscaling on a Tatkal-shaped spike
# Run: pip install simpy && python3 autoscale_sim.py
import simpy, random, statistics

PER_POD_RPS_CAP   = 100      # each pod can serve 100 RPS at p99 SLO
COLD_START_S      = 90       # time from "create pod" to "pod serves traffic"
HPA_SYNC_PERIOD_S = 15
SCALE_UP_TARGET   = 70       # add pods when CPU > 70%
SCALE_DOWN_TARGET = 30
SIM_DURATION_S    = 600
WARMED_FLOOR      = 80       # predictive: keep at least this many pods 09:55–10:05

def offered_rps(t):
    """BharatRail Tatkal shape: 4K baseline, ramps to 220K at t=120s, decays by t=240s."""
    if t < 100:    return 4_000
    if t < 130:    return 4_000 + (t - 100) * 7_200          # 4K -> 220K
    if t < 200:    return 220_000 - (t - 130) * 2_500        # peak then decay
    if t < 250:    return max(50_000, 220_000 - (t - 130) * 2_500)
    return 8_000 + random.gauss(0, 1_000)                    # post-rush calm

class Service:
    def __init__(self, env, initial_pods=10):
        self.env = env
        self.pods_serving = initial_pods   # ready pods
        self.pods_pending = 0              # cold-starting pods
        self.served = 0
        self.shed   = 0
        self.lat_samples = []

    def cpu_pct(self):
        capacity = self.pods_serving * PER_POD_RPS_CAP
        if capacity == 0: return 100
        return min(100, 100 * offered_rps(self.env.now) / capacity)

    def add_pods(self, n):
        self.pods_pending += n
        self.env.process(self._cold_start(n))

    def _cold_start(self, n):
        yield self.env.timeout(COLD_START_S)
        self.pods_pending -= n
        self.pods_serving += n

def hpa_loop(env, svc, predictive=False):
    while True:
        yield env.timeout(HPA_SYNC_PERIOD_S)
        cpu = svc.cpu_pct()
        # metric-based decision
        if cpu > SCALE_UP_TARGET:
            target = max(svc.pods_serving + 1,
                         int(svc.pods_serving * cpu / SCALE_UP_TARGET))
            need = target - svc.pods_serving - svc.pods_pending
            if need > 0: svc.add_pods(need)
        elif cpu < SCALE_DOWN_TARGET and svc.pods_serving > 5:
            svc.pods_serving = max(5, svc.pods_serving - 1)
        # predictive: pre-warm at 09:55 (sim t=0..60s window before spike)
        if predictive and 0 < env.now < 60 and svc.pods_serving < WARMED_FLOOR:
            need = WARMED_FLOOR - svc.pods_serving - svc.pods_pending
            if need > 0: svc.add_pods(need)

def workload(env, svc):
    while True:
        yield env.timeout(1)
        rps = max(0, offered_rps(env.now))
        capacity = svc.pods_serving * PER_POD_RPS_CAP
        if rps <= capacity:
            svc.served += rps
            svc.lat_samples.append(40 + 200 * (rps / max(1, capacity))**4)
        else:
            svc.served += capacity
            svc.shed   += rps - capacity
            svc.lat_samples.append(2000)   # queueing blow-up
        if len(svc.lat_samples) > 1000:
            svc.lat_samples = svc.lat_samples[-1000:]

def reporter(env, svc, label):
    while True:
        yield env.timeout(30)
        p99 = sorted(svc.lat_samples)[int(0.99 * len(svc.lat_samples))] if svc.lat_samples else 0
        print(f"[{label}] t={env.now:4.0f}s  pods={svc.pods_serving:3d} (+{svc.pods_pending:2d} pending)  "
              f"cpu={svc.cpu_pct():5.1f}%  served/s={svc.served:7d}  shed/s={svc.shed:6d}  p99={p99:6.0f}ms")
        svc.served = 0; svc.shed = 0   # reset rate counters

def run(predictive):
    env = simpy.Environment()
    svc = Service(env)
    env.process(workload(env, svc))
    env.process(hpa_loop(env, svc, predictive=predictive))
    env.process(reporter(env, svc, "PRED" if predictive else "HPA "))
    env.run(until=SIM_DURATION_S)

if __name__ == "__main__":
    print("=== Metric-based HPA only ===")
    run(predictive=False)
    print("\n=== Metric-based HPA + predictive pre-warm ===")
    run(predictive=True)

Sample run:

=== Metric-based HPA only ===
[HPA ] t=  30s  pods= 10 (+ 0 pending)  cpu= 12.0%  served/s= 120000  shed/s=     0  p99=    40ms
[HPA ] t=  60s  pods= 10 (+ 0 pending)  cpu= 12.0%  served/s= 120000  shed/s=     0  p99=    40ms
[HPA ] t=  90s  pods= 10 (+ 0 pending)  cpu= 12.0%  served/s= 120000  shed/s=     0  p99=    40ms
[HPA ] t= 120s  pods= 10 (+ 8 pending)  cpu=100.0%  served/s=  29856  shed/s= 1856144  p99=  2000ms
[HPA ] t= 150s  pods= 10 (+78 pending)  cpu=100.0%  served/s=  30000  shed/s= 5320118  p99=  2000ms
[HPA ] t= 180s  pods= 10 (+78 pending)  cpu=100.0%  served/s=  30000  shed/s= 4360112  p99=  2000ms
[HPA ] t= 210s  pods= 88 (+ 0 pending)  cpu= 99.7%  served/s= 263248  shed/s=    752  p99=  1980ms
[HPA ] t= 240s  pods= 88 (+ 0 pending)  cpu= 12.0%  served/s= 105600  shed/s=     0  p99=    40ms

=== Metric-based HPA + predictive pre-warm ===
[PRED] t=  30s  pods= 10 (+70 pending)  cpu= 12.0%  served/s= 120000  shed/s=     0  p99=    40ms
[PRED] t=  60s  pods= 10 (+70 pending)  cpu= 12.0%  served/s= 120000  shed/s=     0  p99=    40ms
[PRED] t=  90s  pods= 10 (+70 pending)  cpu= 12.0%  served/s= 120000  shed/s=     0  p99=    40ms
[PRED] t= 120s  pods= 80 (+ 0 pending)  cpu= 30.0%  served/s= 240000  shed/s=     0  p99=    72ms
[PRED] t= 150s  pods= 80 (+ 8 pending)  cpu= 99.4%  served/s= 720000  shed/s= 760118  p99=  1820ms
[PRED] t= 180s  pods= 88 (+ 0 pending)  cpu= 90.7%  served/s= 800000  shed/s=  20112  p99=  1240ms
[PRED] t= 210s  pods= 88 (+ 0 pending)  cpu= 99.7%  served/s= 800000  shed/s=    752  p99=   980ms
[PRED] t= 240s  pods= 88 (+ 0 pending)  cpu= 12.0%  served/s= 105600  shed/s=     0  p99=    40ms

Walking the load-bearing lines. COLD_START_S = 90 is the simulated cold-start time — typical for a JVM-based service with a 200 MB image. Knock it down to 5 with a Go binary and the HPA-only path comes much closer to keeping up, but not all the way; even with instant pods the metric-decision lag is 15–30 s. if predictive and 0 < env.now < 60 is the predictive pre-warm: we know from history that the Tatkal spike fires at sim-t=100s, and we begin pre-warming 100 s before it (real-world: pre-warm 5 minutes before the known event). The pods that take 90 s to cold-start are already warm when the spike begins. target = max(svc.pods_serving + 1, int(svc.pods_serving * cpu / SCALE_UP_TARGET)) is the canonical HPA scaling formula — desired = current × currentMetric / targetMetric — and it computes the theoretical number of pods needed to bring the metric to the target. The catch: it operates on the current CPU, not the future CPU; on a fast-rising spike, by the time the new pods are serving, the metric is even higher and a second round of scaling is needed. svc.lat_samples.append(2000) when capacity is exceeded models the queueing blow-up — past 100% utilisation, latency does not degrade gracefully; it cliffs (/wiki/m-m-1-and-why-utilization-80-hurts). elif cpu < SCALE_DOWN_TARGET and svc.pods_serving > 5 scales down only one pod at a time per sync cycle — most production HPAs are conservative on scale-down because thrashing is more costly than idle capacity. The --horizontal-pod-autoscaler-downscale-stabilization window (default 5 minutes) damps it further in real K8s.

The contrast is the entire pedagogical content — the HPA-only path sheds 6.7M requests across 90 seconds and hits p99=2s; the predictive-pre-warm path sheds 780K and brings p99 back to ~1s within 60 s. Neither is perfect — even pre-warming, the spike's peak exceeds the pre-warmed capacity by 9×, so some shedding is mandatory. The lesson is that predictive scaling moves the steady-state-served-capacity-by-time graph closer to the offered-load graph, but does not eliminate the need for shedding entirely.

Why the predictive path still sheds: the pre-warmed floor was 80 pods (8M RPS capacity), but the peak offered load was 220K RPS — so even pre-warmed, the service is at 100% utilisation during the peak. The right number for WARMED_FLOOR should match the predicted peak with a safety margin (typically 1.2–1.5×), but pushing the floor up means more idle pods 364 days a year. The economic trade-off is between provisioned-capacity cost (₹X/month per idle pod) and shed-request cost (lost transactions × revenue per transaction). At SetuStream, IPL match starts have a known shape; the pre-warm floor is set to predicted peak × 1.3, and the cost-per-month of carrying that capacity is justified by a single avoided outage during the IPL final.

Predictive autoscaling — when the future is knowable

Metric-based scaling reacts to the present. Predictive scaling forecasts the near future and acts now. Three signal sources feed predictive policies:

Calendar / time-of-day patterns. The simplest and most reliable. BharatRail Tatkal opens at 10:00 IST every weekday, with the spike beginning ~30 s before the official open and decaying over 5 minutes. Indian payment systems see a spike between 23:55 and 00:05 on the 1st of every month (rent autopay, EMI deduction, salary credit settlement). BhojanBox/ZaikaApp see lunch (12:30–14:00) and dinner (19:30–22:30) peaks every day. SetuStream's IPL match starts are scheduled weeks in advance. For these, the predictive system reads a calendar (a YAML file, a database table, or a service like Querion Calendar API for human-managed events) and pre-warms capacity N minutes before each scheduled event. The pre-warm time = max(cold-start time, metric-loop reaction time) + safety margin.

Recurring-pattern forecasting. AWS Predictive Scaling, GCP Compute Engine Autoscaler's "Predictive" mode, and Kubernetes KEDA's predictive scaler use time-series forecasting — typically Holt-Winters seasonal decomposition or Sociogram Prophet — over the last 14–30 days of historical metrics. The model learns hourly, daily, and weekly seasonality; predicts the next hour's load; and pre-scales 5–15 minutes before the predicted peak. The advantage is zero manual configuration: you turn it on, the model learns. The disadvantage is that it predicts the expected future based on history; novel events (a marketing campaign, a viral social-media post, a competitor's outage routing traffic to you) are missed entirely. AWS publishes the predicted-vs-actual chart; deviation greater than ~30% means the model is wrong and metric-based scaling is doing all the work.

Upstream-signal forecasting. A signal from upstream tells you what the next minute looks like. CDN cache hit rate dropping is a leading indicator of origin load increase. API gateway request rate from the mobile-app SDK is a 30-second leading indicator of backend load (the SDK shows the loading spinner before the backend sees the request burst). Push notification dispatch is a 60-second leading indicator of app-open spikes (when BharatBazaar sends a sale-launch notification to 100M users, a fraction will tap it within seconds). For these, the predictive system subscribes to the upstream signal and scales preemptively — not on a schedule, but on evidence that the spike is about to arrive. This is the most powerful form, but it requires building the cross-system observability that most companies lack.

The practical reality: production systems use all three layered. A weekly recurring schedule sets the base capacity for the day. Calendar overrides handle known events (IPL, Tatkal, sale launches, NSE/BSE market open at 09:15). Upstream signals catch the unknowns within their reach. Metric-based reactive scaling catches the rest. Each layer covers what the previous one missed; the system overall has fewer surprises than any single layer alone.

Three-layered predictive autoscaling stack with metric-based reactive at the baseThree horizontal stacked layers showing predictive sources. Bottom layer: metric-based reactive (always on, lag 60-180s). Middle: calendar/time-of-day (known events, pre-warm 5 min before). Top: upstream signal (CDN miss rate, push notification fan-out, 30-60s lead).Layered predictive autoscaling — each layer covers what the layer below missedLayer 3: Upstream signals (30–60 s lead)CDN miss rate ↑, push notification fan-out, mobile-SDK request burstcatches viral / novel spikesLayer 2: Calendar / scheduled (5 min lead)IPL match, Tatkal 10:00, NSE 09:15, salary day, sale launchcatches known eventsLayer 1: Metric-based reactive (60–180 s lag)HPA on CPU / memory / custom metric, Cluster Autoscaler on pending podscatches the rest
Three layers of autoscaling; each operates on a different time horizon. The reactive HPA layer is always on; calendar and upstream-signal layers add pre-warmed capacity when the future is knowable. The unhandled gap shrinks with each layer added; load shedding handles whatever still gets through. Illustrative.

What metric to scale on — and why CPU is usually wrong

The HPA defaults to CPU-percentage as the scaling metric, which is wrong for most modern services. CPU is the right metric only when CPU is the actual bottleneck — true for a CPU-bound microservice doing JSON parsing or cryptographic work, false for almost everything else. The metric you scale on must be causally linked to the SLO you care about.

Wrong metric, real failure mode. A service that mostly waits on a downstream database has low CPU even when its in-flight request count is enormous and its p99 is failing SLO. Scaling on CPU adds zero pods because CPU is fine; meanwhile the queue grows, requests time out, and the downstream gets retry-stormed. PaisaBridge learned this on a payment-gateway service in 2022 — the service's CPU was 25% during a checkout spike, but it was waiting on a slow card-network response; scaling on CPU did nothing while p99 climbed to 4 s. The fix: scale on in-flight request count (a custom metric) or request-queue depth, both of which directly track the SLO.

Right metrics for common service shapes:

  • CPU-bound services (JSON parsing, encryption, image processing, ML inference): cpu_utilization is correct.
  • I/O-bound services (most API services that call databases): inflight_requests per pod, or request_queue_depth, or response_time_p99 (Kubernetes-style "external metrics" via Prometheus Adapter).
  • Memory-bound services (caches, in-memory analytics): memory_utilization, but be careful — JVM heap usage is misleading because the JVM holds memory it does not actively use.
  • Network-bound services (proxies, file-transfer): network_bytes_per_second, often expressed as fraction of NIC capacity.
  • Queue-consumer services (worker pools reading from Kafka/SQS/Redis Streams): queue_lag or queue_depth_per_pod (KEDA's bread and butter).
  • Connection-bound services (WebSocket, gRPC streaming, persistent SSE): active_connections_per_pod.

The rule: scale on the metric that, when it grows, your SLO breaks. If CPU climbing to 80% is when p99 starts to suffer, scale on CPU at 70%. If queue depth climbing past 50 is when p99 starts to suffer, scale on queue depth at 40. The threshold should be the lower of (the queueing knee at ρ ≈ 0.85, the metric value where SLO breaks). The 80% CPU default is convenient; it is rarely correct.

Common confusions

  • "Autoscaling makes load shedding unnecessary." No — the autoscaler's reaction time (60–240 s end to end) is longer than most production spikes (5–60 s). Load shedding handles the 60–240 s gap; autoscaling handles the steady-state shift after. A service with autoscaling but no shedding is one cold-start away from cascading failure.
  • "Predictive scaling is always better than reactive." No — predictive only helps when the future is predictable. For a service with truly unpredictable load (a viral social-media post hitting your blog, a competitor's DNS failure routing traffic to you), predictive scaling with a learned model will under-predict and the system falls back on reactive. Predictive augments reactive; it does not replace it.
  • "Scaling on CPU is the safe default." No — CPU is the right metric only for CPU-bound services. For I/O-bound services (most modern microservices), CPU under-reads saturation by 5–10×. Scale on the metric that actually tracks the SLO (in-flight requests, queue depth, p99 latency, or queue lag), not the metric that happens to be enabled in the metrics-server out of the box.
  • "More aggressive autoscaling (lower target, faster sync) is better." No — aggressive scaling causes thrashing: scale up on a 10-second spike, scale down 30 s later, scale up on the next spike. Each scale event has cost (cold-start latency, autoscaler-controller load, cloud-provider rate limits on instance creation). The right tuning balances responsiveness against thrashing; default sync periods (15 s for HPA, 5 min stabilisation for scale-down) are conservative for a reason.
  • "Cluster Autoscaler will add nodes fast enough." No — typical cloud VM provisioning is 45–180 s. If the HPA wants pods that don't fit on existing nodes, the wait is the full VM provisioning time on top of pod cold-start. Pre-provision node capacity (over-provision a small buffer of empty nodes) for spike-tolerant services; the cost is small, the lag improvement is large.
  • "Predictive scaling will replace SREs." No — predictive scaling automates the easy case (recurring patterns). The hard part is novel events, dependency outages, capacity-planning trade-offs, cost-vs-availability decisions, and the politics of choosing what to shed first. The autoscaler runs the policy; the SRE picks the policy.

Going deeper

KEDA — Kubernetes Event-Driven Autoscaling beyond CPU

The default HPA only scales on CPU and memory; KEDA (Kubernetes Event-Driven Autoscaling, CNCF Incubating) adds 60+ scalers — Kafka consumer lag, RabbitMQ queue depth, AWS SQS messages, Prometheus query results, MySQL row counts, Redis Streams pending messages. The architectural shift is treating demand signal as the scaling metric — for a Kafka consumer pool, the right scaler is kafka_consumergroup_lag, which tells you exactly how far behind the workers are. KEDA also supports scale-to-zero for event-driven workloads (no events arriving → drop to zero pods, cold-start the first pod when the next event arrives), which Lambda/Cloud Run already do. The trade-off is the cold-start latency on the first request; for batch/async workloads this is acceptable, for interactive APIs it is not.

Vertical Pod Autoscaling — when adding more pods is wrong

Horizontal scaling (more pods) is the default, but some workloads scale better vertically (bigger pods). A single-leader workload (a primary database, a Kafka broker, a Redis primary) cannot benefit from more replicas — adding a pod doesn't add primary capacity. The Vertical Pod Autoscaler (VPA) observes pod resource usage over a long window and recommends (or auto-applies) larger CPU/memory requests. The catch: VPA's auto-apply mode requires pod restarts to change requests (because containers cannot resize), so it is disruptive. Modern Kubernetes (1.27+) supports in-place pod resize, which removes the restart requirement. The decision tree: scale horizontal when work is parallelisable across pods, scale vertical when one pod's capacity matters and parallelism doesn't help.

Cost-aware autoscaling — spot instances and the cost-availability frontier

Autoscaling on AWS / GCP / Azure can mix on-demand and spot/preemptible instances; the latter are 60–90% cheaper but can be reclaimed at 2-minute notice. Cost-aware autoscalers (Karpenter on AWS, the spot-aware mode of Cluster Autoscaler) maintain a target mix — for example, 30% on-demand for baseline, 70% spot for elastic capacity. The risk is that during a regional spot-shortage, all spot capacity is reclaimed simultaneously, leaving 30% capacity. Karpenter handles this by diversifying across instance families (m5 / c5 / r5 / m6i / c6i) and AZs, reducing the joint-failure probability. The practical pattern at Indian companies running on spot: keep on-demand baseline = predicted minimum + 1 × cold-start-window-of-additional-spot-capacity, so a sudden spot reclamation is masked by the next on-demand scale-up.

Reproducing the simulator with cold-start sweep

# Reproduce on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install simpy
python3 autoscale_sim.py

# Sweep cold-start times to see the impact
for t in 5 30 90 180; do
  COLD=$t python3 -c "
import os, autoscale_sim
autoscale_sim.COLD_START_S = int(os.environ['COLD'])
autoscale_sim.run(predictive=False)"
done

The 5 s cold-start (Go service) keeps the metric-based path within 200 ms p99 even at the spike peak; the 180 s cold-start (Java + Spring Boot + a slow @PostConstruct) cannot recover before the spike has passed. The single highest-leverage autoscaling improvement most teams can make is shrinking cold-start; predictive scaling is the second-best improvement only because it is harder to deploy.

When autoscaling makes things worse — feedback loops with downstream

Scaling up a service that calls a downstream creates more load on the downstream. A payment-API service that scales from 50 to 200 pods because of an upstream traffic spike now hits the database with 4× the connection count, possibly exhausting the database's connection pool and creating a downstream outage where there was only an upstream spike. The defence is scaling-aware connection-pool sizing — each pod gets a small connection budget (5–10), the database can handle 200 pods × 5 connections = 1000 connections — and explicit downstream coordination (the database has its own autoscaler, or a read replica fleet that absorbs the read traffic, or a connection-pool middleware like PgBouncer that multiplexes pod connections onto a fixed downstream pool). Autoscaling without downstream coordination just moves the bottleneck; the system is no faster, just differently broken.

Where this leads next

Autoscaling is the capacity-side response to demand variance; load shedding (/wiki/load-shedding-strategies) is the demand-side response. They are complements, and the best-engineered systems use both with explicit awareness of which gap each one fills.

The closing rule: autoscaling is a control system with a 60–240 second response time, and control theory tells you it cannot compensate for a 10-second disturbance. For known disturbances (Tatkal, IPL match starts, payday), use predictive pre-warm. For unknown disturbances within your reaction window, use load shedding. For drift over hours, use metric-based reactive scaling. The composition — predictive plus reactive plus shedding — is what makes a service that survives Mega Bargain Days look effortless from the outside; the absence of any layer is what makes the next-day blameless postmortem necessary.

References