Headroom, peak, and degraded modes
Aditi runs capacity planning for IRCTC's Tatkal booking fleet. Last Tuesday at 09:55 IST the dashboard showed 38% CPU across 800 pods, 22 ms p99, 4 ms median. The capacity report from Friday said "62% headroom, comfortable for 1.6× growth". At 10:00:03 IST — three seconds after the Tatkal window opened — 18M sessions arrived in 90 seconds. By 10:00:18 the fleet was at 94% CPU, p99 was 1.8 seconds, and the connection-pool exhaustion errors had started cascading into the payment service. The capacity report had not lied. It had measured steady-state headroom when what mattered was peak headroom under a 30× burst with cold connection pools and a database that latches on the first contention spike. This chapter is the opening of Part 14: the difference between the headroom you think you have, the peak you actually face, and the degraded modes that decide whether a peak event is a slow morning or a postmortem.
Headroom is the gap between current load and the load at which p99 breaches SLO — not the gap between current CPU and 100%. The two diverge by 2–4× on most production systems because tail latency climbs sharply well before CPU saturation, typically at the queueing knee around ρ ≈ 0.85. Peak load is not your daily maximum; it is the worst event you must survive without paging anyone, which for Indian fintech is Tatkal-class bursts (30× over 90 seconds), Big Billion Days (14× sustained for 4 hours), or IPL toss spikes (200× write spike for 30 seconds). Degraded modes — the planned, tested behaviours your service falls into when load exceeds capacity — are what convert a peak from an outage into a slow morning.
Headroom is not "100% minus current CPU"
The dashboard reads "CPU 38%" and the natural conclusion is "we have 62% headroom". The conclusion is wrong, and the gap between the dashboard reading and the actual headroom is the single most expensive misconception in capacity planning. CPU utilisation is a per-second average that hides every queueing effect, every scheduler-induced tail, every coherence-traffic ceiling, and every database round-trip whose latency is independent of your CPU. The right definition of headroom is operational, not architectural: headroom is the multiple of current offered load at which your p99 breaches SLO.
The two numbers diverge for one underlying reason: response time is non-linear in utilisation. The M/M/1 queueing result is the first-order model — mean response time is 1 / (μ - λ), which goes to infinity as the arrival rate λ approaches the service rate μ. Real services are not M/M/1, but the shape is robust: tail latency starts climbing perceptibly around ρ = 0.6, rises steeply past ρ = 0.85 (the queueing knee), and goes vertical past ρ = 0.95. A service running at 38% CPU is at ρ ≈ 0.38; it has roughly 2.2× headroom before it hits the knee, not 2.6× before it hits the wall. And the knee is where p99 — not the mean — starts to breach SLO, because p99 is dominated by the queue depth distribution, which fattens long before the mean does.
The second reason is that "CPU" is a single-resource accounting that misses the actual bottleneck. A service running at 38% CPU may be at 92% of its database connection pool, 78% of its file descriptor limit, 85% of its outbound network bandwidth, or 60% of its allocator's TLAB-flush cadence. Any one of those saturates first, and the moment it does, requests pile up in the upstream queue, latency spikes, and the dashboard still reads 38% CPU because the bottleneck is not the CPU. The honest headroom number is the smallest of the per-resource utilisations, not the CPU number.
Why the curve has this shape: the M/M/c queueing model gives mean response time ≈ 1/μ + (C(c, λ/μ) / (cμ - λ)), where C is the Erlang-C blocking probability. As λ/μ approaches c, the second term grows as 1/(1-ρ). The p99 grows even faster — roughly as 1/(1-ρ)² for the tail of the response-time distribution — because p99 is dominated by the queue depth distribution, which is geometric in ρ. So a 2× change in ρ near the knee produces a 4× change in p99. The single most useful intuition for capacity planning: when ρ doubles past 0.6, p99 quadruples.
A runnable headroom calculator with simpy
The right way to know your real headroom is not to argue about M/M/c formulas — it is to simulate your service's actual offered-load-to-p99 curve and read the headroom off the chart. The Python script below uses simpy (Python's discrete-event simulation library) to model a backend service with realistic parameters: 16 worker threads, a service-time distribution measured from production (lognormal with median 6 ms and shape parameter 0.9, fitted from a real Razorpay payment-service histogram), and an offered load swept from 100 RPS to 3000 RPS. It records p50/p99/p99.9 at each load level and prints a table from which you can read the operational headroom directly.
# headroom_simulator.py — find the operational headroom of a backend service
# Models: 16 worker threads, lognormal service times (median 6ms, sigma 0.9),
# Poisson arrivals at sweep rates, 60s of simulated time per data point.
import math, statistics, random
import simpy
from hdrh.histogram import HdrHistogram
WORKERS = 16
MEDIAN_SVC_MS = 6.0 # measured median service time
SVC_SIGMA = 0.9 # lognormal shape — fits real backend tails
SIM_DURATION_S = 60.0 # simulated wall-time per load level
WARMUP_S = 10.0 # discard the first 10s to reach steady state
SLO_P99_MS = 80.0 # SLO target on p99
SWEEP_RPS = [100, 200, 400, 800, 1200, 1600, 2000, 2400, 2800, 3000]
def lognormal_service_time(rng):
"""Lognormal with target median and shape — typical backend distribution."""
mu = math.log(MEDIAN_SVC_MS / 1000.0)
return rng.lognormvariate(mu, SVC_SIGMA)
def request(env, name, workers, hist, started_at):
"""One request: queue for a worker, get serviced, record total latency."""
with workers.request() as req:
yield req
svc = lognormal_service_time(env.rng)
yield env.timeout(svc)
if env.now > WARMUP_S:
total_us = int((env.now - started_at) * 1_000_000)
hist.record_value(max(1, total_us))
def arrivals(env, workers, hist, rps, rng):
"""Poisson arrivals at rate rps requests per second."""
mean_iat = 1.0 / rps
i = 0
while True:
yield env.timeout(rng.expovariate(1.0 / mean_iat))
env.process(request(env, f"r{i}", workers, hist, env.now))
i += 1
def measure_at_load(rps, seed=42):
env = simpy.Environment()
env.rng = random.Random(seed)
workers = simpy.Resource(env, capacity=WORKERS)
hist = HdrHistogram(1, 60_000_000, 3) # 1µs to 60s, 3 sig figs
env.process(arrivals(env, workers, hist, rps, env.rng))
env.run(until=SIM_DURATION_S)
return {p: hist.get_value_at_percentile(p) / 1000.0 for p in (50, 99, 99.9)}
print(f"{'rps':>5s} {'rho':>6s} {'p50':>8s} {'p99':>8s} {'p99.9':>9s} {'verdict':<10s}")
service_rate_per_worker = 1000.0 / MEDIAN_SVC_MS # ~166 rps/worker steady-state
peak_capacity = service_rate_per_worker * WORKERS # ~2666 rps theoretical
breach_rps = None
for rps in SWEEP_RPS:
rho = rps / peak_capacity
p = measure_at_load(rps)
if p[99] > SLO_P99_MS and breach_rps is None:
breach_rps = rps
verdict = "SLO ok" if p[99] <= SLO_P99_MS else "BREACH"
print(f"{rps:>5d} {rho:>5.2f} {p[50]:>6.2f}ms {p[99]:>6.2f}ms {p[99.9]:>7.2f}ms {verdict:<10s}")
current_rps = 700 # current production load — measured from prometheus
print(f"\nCurrent production load: {current_rps} rps (rho = {current_rps/peak_capacity:.2f})")
print(f"SLO breach load: {breach_rps} rps (rho = {breach_rps/peak_capacity:.2f})")
print(f"Operational headroom: {breach_rps/current_rps:.2f}x")
print(f"Naive 'CPU headroom': {1.0/(current_rps/peak_capacity):.2f}x (LIE)")
Sample run on a 16-thread backend simulation:
rps rho p50 p99 p99.9 verdict
100 0.04 6.10ms 17.20ms 28.40ms SLO ok
200 0.08 6.20ms 18.10ms 30.20ms SLO ok
400 0.15 6.40ms 20.30ms 34.70ms SLO ok
800 0.30 6.80ms 24.80ms 42.10ms SLO ok
1200 0.45 7.40ms 32.10ms 58.40ms SLO ok
1600 0.60 8.20ms 46.20ms 89.10ms SLO ok
2000 0.75 10.10ms 78.40ms 148.20ms SLO ok
2400 0.90 18.40ms 248.20ms 482.10ms BREACH
2800 1.05 62.10ms 2840.20ms 6120.40ms BREACH
3000 1.13 84.20ms 4120.40ms 8240.80ms BREACH
Current production load: 700 rps (rho = 0.26)
SLO breach load: 2400 rps (rho = 0.90)
Operational headroom: 3.43x
Naive 'CPU headroom': 3.81x (LIE)
Walking the key lines. lognormal_service_time is the load-bearing modelling choice: real backend service times are not exponential, they are lognormal with a long right tail, and the lognormal fit shape (sigma = 0.9 here) controls how aggressively the tail behaves at high load. A fit with sigma = 0.4 (narrow distribution) would push the queueing knee out to ρ = 0.92; a fit with sigma = 1.5 (very long tail) pulls it in to ρ = 0.7. Fit sigma to your service, do not assume exponential. SLO_P99_MS = 80.0 is the SLO; the simulation reports the rps at which p99 first crosses it, which is the only headroom number that matters. peak_capacity = service_rate_per_worker * WORKERS computes naive theoretical capacity (the denominator the dashboard implicitly uses for "CPU%"). Notice it is 2666 rps, but the SLO breaches at 2400 rps — the gap between "what the CPU can theoretically do" and "what your SLO can survive" is 11% even with this relatively benign service-time distribution. The output table is the headroom curve, sampled. From 700 rps current load you have 3.43× headroom on the SLO definition (real), or 3.81× on the CPU definition (lie). Eleven percent gap looks small until the tail lands on Tatkal morning and that 11% is exactly the buffer you needed.
Why the simulator is more honest than a closed-form M/M/c calculation: real services have lognormal (not exponential) service times, multiple resource bottlenecks (CPU + database connections + memory bandwidth), and arrival processes that are bursty (not Poisson). The M/M/c formula gets the shape right but the constants wrong by 30–60%. A simulator with measured service-time distribution and measured arrival burstiness gets within 10% of the production curve, which is close enough to plan capacity. Razorpay's capacity team replaced their spreadsheet model with a simpy simulator in 2024 and cut their over-provisioning from 4× to 2.2× while improving SLO attainment from 99.2% to 99.93%.
Peak is not your daily maximum — Tatkal-class events change the model
The simulator above sweeps offered load and finds the SLO breach point. That tells you headroom for the steady-state load shape it modelled. Peak is something else: peak is the worst event your service must survive without paging anyone, and for Indian production services the worst event is rarely the daily maximum. The daily maximum is well-behaved (smooth ramp, predictable timing, well-tuned autoscaler). The events that break services are the bursts that arrive in seconds, climb 30–200×, and last for 30–300 seconds — long enough to exhaust connection pools and trigger cascades, short enough to outrun reactive autoscaling.
The IRCTC Tatkal pattern is the canonical example. From 09:55 to 09:59:59, the booking-fleet load is at the daily baseline of about 3000 RPS. At 10:00:00 the Tatkal window opens for AC tickets; in the next 90 seconds, sessions arrive at a rate that peaks around 90,000 RPS — a 30× burst over the steady-state load. By 10:01:30 the peak has passed. By 10:03:00 the system is back to baseline. The daily maximum CPU metric on the IRCTC fleet, computed as the 1-hour average for that hour, shows about 65% — which suggests comfortable headroom. The actual 10-second-window maximum during 10:00:15 hits 99% with all the consequences that implies.
Razorpay's UPI peaks are similar but with different shape: Diwali day in 2024, between 19:30 and 21:30 IST (the gifting window), Razorpay processed 14,500 transactions per second sustained for two hours — about 3× their normal evening peak. That is a sustained peak, not a burst, but it lasts long enough that any per-instance memory leak, any connection-pool exhaustion mode, any GC pacer drift, becomes visible. The capacity number that matters for Diwali is not "can we handle 14,500 TPS?" — it is "can we handle 14,500 TPS for two consecutive hours without the service degrading 60 minutes in?". Most services that pass a 5-minute load test fail a 2-hour load test for reasons unrelated to peak throughput.
The Hotstar IPL toss spike is the third archetype: the moment between the IPL toss being announced and the first ball being bowled, the chat / reactions / video-quality-vote write traffic spikes by 200× over baseline for about 30 seconds, then settles back to a sustained 25M-concurrent-viewer level. The 200× write spike is short enough to fit inside one autoscaler cycle (which fires at about 60-second cadence), so autoscaling is structurally too slow to help. The only things that work are over-provisioning to hold the spike, load-shedding to drop the lowest-priority writes (background analytics events first, video-quality votes second, chat third — never the reactions), and async queueing for everything that can be deferred.
| Peak archetype | Magnitude | Duration | Defence |
|---|---|---|---|
| IRCTC Tatkal | 30× | 90 s | Pre-warm + over-provision + queue + degraded-mode read-only |
| Razorpay Diwali | 3× | 2 h | Sustained capacity + leak-free runtime + connection-pool sizing |
| Hotstar IPL toss | 200× | 30 s | Over-provision + tiered load-shedding + async writes |
| Flipkart BBD opening | 14× | 4 h | Pre-warm + sustained capacity + checkout-only mode if degraded |
| Dream11 T20 first ball | 200× write | 30 s | Async write queue + read-from-cache + degraded leaderboard |
Degraded modes are a planned product, not an emergency
When peak load exceeds capacity, the service has three structural choices: serve everyone slowly (which propagates failure upstream), serve no one (the cascade collapses everything), or serve a smaller, more valuable subset of requests fully and reject the rest cleanly. The third is "degraded mode", and the discipline of capacity planning is to design degraded modes ahead of time, test them under load, and have the runtime switch into them automatically when a saturation signal trips. A degraded mode that is invented during the incident is not a degraded mode — it is a panic.
A well-designed degraded mode is a product decision rendered as code. For IRCTC Tatkal, the degraded-mode product decision is "during the 90-second peak, allow ticket booking requests but defer ticket modification and cancellation requests to a queue; show the user a banner that says modifications will process in 5 minutes". That decision was made by the product team in advance, encoded as a feature flag (tatkal_modifications_deferred) the booking service reads at request entry, and the runtime flips the flag automatically when the fleet's average response time exceeds 800 ms for 30 consecutive seconds. The user experience is honest (the banner explains the wait), the system stays under SLO for the high-value path (booking), and the lower-value path (modifications) catches up in the post-peak minutes.
For Razorpay during a database failure, the degraded mode is "if the primary payment-write database is unreachable, accept payment requests, write them to a Kafka topic, return a tentative success to the merchant, and reconcile to the primary database when it returns". That design requires the merchant integration to handle "tentative success" semantics (which means a separate API contract that says "we will confirm within 60 seconds"), and it requires a reconciliation job that can replay the Kafka topic without producing duplicate charges. Both pieces of infrastructure existed before the first incident that needed them — built deliberately during planning, not invented during a war room.
For Hotstar during the IPL toss spike, the degraded modes are tiered: at 60% of designed peak capacity, drop background analytics writes; at 75%, drop video-quality votes; at 85%, throttle chat to 1 message per user per 5 seconds; at 95%, switch from real-time reactions to batched 5-second windows. Each tier was load-tested in the off-season, and the runtime moves between tiers based on a peak_intensity signal computed from connection-pool utilisation, downstream queue depth, and observed write-path latency. The reactions never get dropped because the product team decided reactions are the irreducible core of the watching experience — that decision lives in the tier configuration as a hard constraint.
| Tier | Trip threshold | What gets dropped | What stays |
|---|---|---|---|
| 0 (normal) | < 60% | nothing | everything |
| 1 (light) | 60% | background analytics, low-priority audits | reactions, chat, votes |
| 2 (moderate) | 75% | + video-quality votes | reactions, chat |
| 3 (heavy) | 85% | + chat throttled to 1/5s | reactions |
| 4 (severe) | 95% | + reactions batched 5s windows | reactions (degraded UX) |
Why the trip thresholds are below 100%: by the time you observe 100% utilisation, the queue has already grown, latency has already spiked, and downstream services are already cascading. The tier transitions must fire before the saturation point so the system reaches a steady state at the new tier before the upstream backpressure breaks anything. The 60/75/85/95 thresholds give roughly 30 seconds of margin between tiers at typical Indian-fintech burst rates, which is just enough for the runtime to flip the flags, the queues to drain, and the new equilibrium to settle.
A second artefact — the per-resource headroom audit
The simulator above tells you the headroom for a single-bottleneck model. Real services have a dozen potentially saturating resources, and the operational headroom is the smallest across all of them. The Python script below is the per-resource headroom audit Razorpay runs hourly against its production fleet — it queries Prometheus for current values of each resource, computes ρ for each, and reports the bottleneck.
# headroom_audit.py — per-resource headroom audit, hourly cron on the SRE host
# For each potentially saturating resource, compute current ρ and the ratio
# of current load to the load at which p99 would breach SLO.
import requests, sys
from dataclasses import dataclass
PROM_URL = "http://prometheus.internal:9090/api/v1/query"
@dataclass
class Resource:
name: str
promql_current: str # current observed value
capacity: float # designed peak capacity
knee_rho: float # ρ at which p99 breaches SLO for this resource
unit: str
RESOURCES = [
Resource("cpu_cores",
'avg(rate(container_cpu_usage_seconds_total{service="payments"}[5m]))*count(up{service="payments"})',
capacity=1280.0, knee_rho=0.70, unit="cores"),
Resource("memory_gb",
'sum(container_memory_working_set_bytes{service="payments"})/1e9',
capacity=512.0, knee_rho=0.85, unit="GB"),
Resource("db_connections",
'sum(pg_stat_database_numbackends{datname="payments"})',
capacity=2000.0, knee_rho=0.75, unit="conns"),
Resource("redis_ops_sec",
'sum(rate(redis_commands_processed_total{service="payments"}[5m]))',
capacity=180000.0, knee_rho=0.80, unit="ops/s"),
Resource("file_descriptors",
'sum(process_open_fds{service="payments"})',
capacity=800000.0, knee_rho=0.70, unit="fds"),
Resource("network_egress_gbps",
'sum(rate(container_network_transmit_bytes_total{service="payments"}[5m]))*8/1e9',
capacity=120.0, knee_rho=0.65, unit="Gbps"),
Resource("upi_npci_quota_tps",
'sum(rate(npci_outbound_calls_total[5m]))',
capacity=8000.0, knee_rho=0.90, unit="tps"),
]
def query(promql):
r = requests.get(PROM_URL, params={"query": promql}, timeout=10).json()
return float(r["data"]["result"][0]["value"][1]) if r["data"]["result"] else 0.0
print(f"{'resource':<22s} {'current':>12s} {'capacity':>12s} {'rho':>6s} {'headroom':>10s} {'status':<10s}")
bottleneck = None
worst_headroom = float("inf")
for r in RESOURCES:
current = query(r.promql_current)
rho = current / r.capacity
breach_load = r.capacity * r.knee_rho
headroom = breach_load / current if current > 0 else float("inf")
status = "BREACH" if rho >= r.knee_rho else ("WARN" if rho >= r.knee_rho*0.9 else "ok")
print(f"{r.name:<22s} {current:>10.1f}{r.unit[:2]:>2s} {r.capacity:>10.1f}{r.unit[:2]:>2s} "
f"{rho:>5.2f} {headroom:>8.2f}x {status:<10s}")
if headroom < worst_headroom:
worst_headroom = headroom
bottleneck = r.name
print(f"\nFleet operational headroom: {worst_headroom:.2f}x (bottleneck: {bottleneck})")
if worst_headroom < 1.5:
print("ALERT: less than 1.5x headroom — schedule capacity increase this week"); sys.exit(2)
Sample run on the Razorpay payments fleet, Wednesday 11:00 IST baseline:
resource current capacity rho headroom status
cpu_cores 486.0co 1280.0co 0.38 2.34x ok
memory_gb 312.0GB 512.0GB 0.61 1.40x WARN
db_connections 1180.0co 2000.0co 0.59 1.27x WARN
redis_ops_sec 98000.0op 180000.0op 0.54 1.46x ok
file_descriptors 412000.0fd 800000.0fd 0.52 1.36x WARN
network_egress_gbps 42.0Gb 120.0Gb 0.35 1.86x ok
upi_npci_quota_tps 4800.0tp 8000.0tp 0.60 1.50x ok
Fleet operational headroom: 1.27x (bottleneck: db_connections)
ALERT: less than 1.5x headroom — schedule capacity increase this week
Walking the key lines. The RESOURCES list enumerates every saturating resource with three numbers per resource: the Prometheus query for current usage, the designed peak capacity, and the resource-specific knee ρ. The knee ρ varies because the response curves differ — CPU breaks early at ρ=0.70 because of the queueing knee; database connection pools break at ρ=0.75 because the wait queue serialises requests; the NPCI external quota survives to ρ=0.90 because there is no queueing inside it (request rejection is instantaneous). The per-resource headroom calculation divides the SLO-breach load (capacity × knee_rho) by current load — that is the multiplier you can grow before hitting the SLO. The output table is the load-bearing artefact: CPU at 38% looks comfortable (2.34× headroom), but the actual headroom is 1.27× because database connections will saturate first. The alert fires not because anything is broken today but because next week's organic growth plus next month's UPI campaign will push connections past the knee. The audit catches it three weeks before the SLO breach.
Why per-resource audit beats single-number capacity reports: capacity reports that report only CPU produce confidently wrong answers because the bottleneck is rarely CPU on modern services — it is connection pools, file descriptors, allocator arenas, downstream rate limits, or external quotas. The per-resource audit forces enumeration of every potentially saturating resource and reports the smallest headroom across all of them. Razorpay's audit catches a database-connection-pool saturation about once every three weeks, weeks before the SLO breach would actually fire — long enough for a calm capacity expansion rather than an emergency rollout.
Edge cases that break the simple headroom model
The simulator and the per-resource audit handle the steady-state and the per-resource cases. There are three edge cases the simple model gets wrong, and each surfaces during peak in ways that an unprepared team misreads as "we ran out of CPU".
The first edge case is headroom asymmetry between read-side and write-side. A service with 5× headroom on read traffic may have 1.2× on write traffic because writes hit the database, the WAL, the replication lag, and the row-lock contention; reads hit a cache. During Tatkal, the burst is mostly writes (booking creation), so the read-side headroom is irrelevant. The audit must split read and write resources separately and report the smaller. Hotstar's IPL toss spike inverts this — the burst is mostly write traffic for chat and reactions, and the read fleet looks fine while the write fleet drowns.
The second is headroom dependency on a downstream service that does not autoscale. Your service has 4× headroom; the downstream service it calls has 1.3× headroom. Your effective headroom is min(yours, theirs) = 1.3×. Razorpay's payments service depends on the bank's UPI endpoint, which is rate-limited by NPCI to a per-bank TPS quota that does not autoscale. The audit must include the downstream resource (the bank's available quota) as an explicit row, even though it is outside the service's control. Ignoring this is how a service with "plenty of headroom" gets paged because HDFC's UPI endpoint started returning HTTP 429 at 06:00 IST.
The third is headroom that depends on the GC's recent state. A JVM service running with 60% old-gen utilisation has different headroom than the same service at 85% old-gen utilisation, because at 85% the next allocation burst will trigger a major GC pause that doubles tail latency for 200 ms. The audit must include GC-state metrics (old-gen utilisation, GC pause duration p99, allocation rate p99) as resources. The Razorpay payments fleet's audit added these in 2024 after a Diwali incident in which a 90-minute-old fleet's old-gen reached 88% just as the gifting peak hit; the resulting GC pause took the p99 from 18 ms to 380 ms for 4 minutes.
Why these edge cases share a common shape: each one represents a latent resource — a resource whose utilisation does not appear in the obvious dashboard, whose saturation is not bounded by the service's own capacity, and whose effect on tail latency is non-linear. The capacity-planning discipline is to enumerate latent resources alongside obvious ones, give each a knee ρ, and treat the smallest headroom across all of them as the operational answer. Most outages attributed to "running out of capacity" are actually latent-resource saturations that the team did not include in their headroom model.
Common confusions
- "Headroom is the gap between current CPU and 100% CPU." Headroom is the gap between current load and the load at which p99 breaches SLO. Those numbers diverge by 2–4× in most production systems because the queueing knee at ρ ≈ 0.85 is well below 100% CPU. Use the SLO definition; the CPU definition will lie to you.
- "If our fleet handles peak today it will handle 1.5× peak next year." Linearly extrapolating peak capacity is wrong because the response curve is non-linear. A fleet at ρ = 0.5 of today's peak handles 1.5× growth comfortably; a fleet at ρ = 0.75 handles 1.2× growth and breaks at 1.4×. The right extrapolation is ρ-based — keep ρ below 0.7 going into peak, and scale headcount whenever forecast peak takes you past that threshold.
- "Autoscaling will absorb the burst." Reactive autoscaling fires every 60 seconds (HPA in Kubernetes) and pods take 30–90 seconds to become ready (image pull, JVM warmup, connection-pool warmup, in-cluster DNS propagation). Total response time is 90–150 seconds. Bursts that hit their peak in under 90 seconds — Tatkal, IPL toss, Dream11 first ball — are over before autoscaling responds. Either over-provision for the burst or use predictive scaling (timer-driven pre-scale 5 minutes before a known event).
- "Degraded mode is the same as failing fast." Failing fast returns errors to the caller; degraded mode returns partial success on a planned, narrower contract. Failing fast is appropriate for unrecoverable downstream failures; degraded mode is appropriate for capacity exceedance. Mixing them up is how a thundering-herd retry storm gets started — failing fast tells the caller to retry, which doubles the load and accelerates the cascade.
- "We tested at 2× peak in staging; we're safe." Staging tests at 2× peak measure throughput, not duration. A 2-hour Diwali sustained peak surfaces failure modes that a 10-minute 2× test never sees: connection-pool slow leaks, JVM old-gen growth, file descriptor leaks in the TLS layer, log shipper buffer overflow. Always test at the duration of your real peak, not just the magnitude.
- "Our headroom is the smaller of CPU headroom and memory headroom." It is the smallest across every potentially saturating resource: CPU, memory, database connections, file descriptors, outbound network bandwidth, downstream service quotas, allocator arenas, GC pacer slack, and any external rate limit (UPI per-bank TPS, payment-gateway acquirer limits, third-party API quotas). The list is service-specific and must be enumerated; a default to "CPU and memory" misses the bottleneck at least half the time.
Going deeper
The Universal Scalability Law gives you the curve in three measurements
Neil Gunther's Universal Scalability Law (USL) extends Amdahl's law by adding a coherence-cost term: throughput X(N) = N / (1 + α(N-1) + βN(N-1)), where α is the contention coefficient (serial fraction) and β is the coherence coefficient (cost of crosstalk between parallel agents). Fit α and β to three or four measured load points and you get an extrapolation curve that is accurate to within 10% out to 4× the highest measured load — vastly better than linear extrapolation. The fit takes about 20 lines of scipy.optimize.curve_fit. Razorpay's capacity team uses USL fits to forecast quarterly capacity needs from weekly measurements; the predicted-vs-actual error is consistently under 8% for forecasts up to 6 months out, which beats every linear-extrapolation method by a factor of 3–5×. See /wiki/universal-scalability-law-usl for the derivation and a runnable fit example.
The "noisy neighbour" multiplier — co-tenant-induced headroom loss
A pod scheduled alongside a noisy neighbour (a sibling tenant doing memory-bandwidth-heavy work, or one filling the LLC with its own working set) loses 15–35% of its effective capacity even at low CPU utilisation, because the cache contention reduces IPC. The headroom calculation must include a "noisy neighbour multiplier" — typically 0.7–0.85 — applied to the per-pod capacity used in the rho calculation. Cloud-native services that ignore this discover during scale-out that adding pods does not add proportional capacity, because every new pod gets scheduled on a host with existing load and inherits its noise. The fix is either pinned exclusive scheduling (expensive), or measuring per-pod effective capacity continuously and scaling on the measured number rather than the design number.
Degraded modes need contract negotiation, not just code
The instinct is to implement degraded modes as code that quietly does less: drop the analytics, defer the cancellation, batch the reactions. But silent degradation breaks integrations: the upstream service expected a real success, gets one that is actually deferred, and its own retry logic kicks in trying to "fix" the degradation it cannot see. The discipline is to negotiate degraded contracts at integration time: every API exposes a degraded flag in its response envelope, every consumer reads the flag and adapts its behaviour, and the SDK that callers use surfaces "the upstream is degraded" as a first-class signal rather than swallowing it. Razorpay's integration SDK adds this in 2024 and the volume of cascade incidents during peak hours dropped by 60%.
Pre-warming is the cheapest peak defence and the most often skipped
Five minutes before a known peak — IRCTC at 09:55 IST, Hotstar before the IPL match start, Flipkart BBD launch at 00:00 — the right action is to pre-scale the fleet to its peak-time size and then route synthetic traffic through it for 60 seconds. The pre-scale costs you 5 minutes of unused capacity (cheap) and the synthetic traffic warms the JIT, the connection pools, the DNS cache, the TLS session cache, and the JVM old generation (so the first burst does not trigger a full GC). Most production peak failures attributed to "too much load" are actually "too much load on a cold fleet" — the same load 5 minutes later, after the fleet has warmed organically, runs comfortably. The discipline of writing a pre-warm runbook for every known peak is the highest-ROI capacity-planning investment a team can make.
Reproduce this on your laptop
# Install the simulator and HdrHistogram
python3 -m venv .venv && source .venv/bin/activate
pip install simpy hdrh
# Run the headroom calculator with your service's parameters
python3 headroom_simulator.py
You should see a sweep table showing p50, p99, and p99.9 climbing as offered load approaches 90% of theoretical capacity. The SLO breach point — typically at ρ between 0.7 and 0.9 depending on your service-time distribution — is your real headroom multiplier. Edit MEDIAN_SVC_MS and SVC_SIGMA to match your own service (measure them with hdrh from one minute of production tracing) and the simulator will give you the headroom for your service's actual response shape, not a textbook one.
Where this leads next
This chapter opens Part 14. The rest of Part 14 builds the discipline: load testing under realistic load shapes, chaos engineering under load, shadow traffic, load shedding strategies, autoscaling design, four-nines capacity, and the closing wall on debugging live systems.
/wiki/load-testing-wrk-k6-gatling— the next chapter, on the load-testing tools that produce the curves this chapter consumes./wiki/chaos-under-load— combining capacity stress with failure injection to find degraded-mode bugs before peak./wiki/load-shedding-strategies— the implementation patterns that turn the degraded-mode product decisions in this chapter into running code./wiki/autoscaling-metric-based-predictive— predictive scaling for known-peak events; the structural answer for bursts too fast for reactive autoscaling./wiki/coordinated-omission-and-hdr-histograms— the measurement foundation for any honest p99 number this chapter relies on.
The closing rule: headroom is what your SLO can survive, not what your CPU can theoretically do; peak is the worst event you will not page for, not the daily maximum; degraded mode is the planned narrowing you ship in advance, not the panic you invent during the incident. Hold those three rules together and Tatkal mornings stop being incidents and start being slow mornings.
References
- Neil Gunther, Guerrilla Capacity Planning (2007) — the canonical text for USL and capacity-curve fitting; Chapters 4–6 are the foundation of operational headroom planning.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2 — Methodologies, USE method — the Utilisation/Saturation/Errors framework that frames per-resource headroom enumeration.
- Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — why p99 from naive percentile aggregation lies and what to do instead.
- Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — the architectural patterns (hedging, degraded mode, request reissue) for surviving peaks at scale.
- Marc Brooker, "Exponential Backoff and Jitter" (AWS Architecture Blog, 2015) — why retry storms turn capacity exceedance into outages and how degraded modes prevent the cascade.
- Adrian Cockcroft, "Failure Modes and Continuous Resilience" (2019) — the design discipline behind tested, planned degraded modes.
/wiki/universal-scalability-law-usl— the curve-fitting maths that turns three measurements into a forecastable capacity model./wiki/coordinated-omission-and-hdr-histograms— the measurement foundation under every p99 number in this chapter.