M/M/1 and why utilization > 80% hurts
Karan is on-call for the Zerodha Kite order-entry service at 09:14 IST, six minutes before the cash-equity market opens. The dashboard says CPU is 78% across the 24 pods, mean response time is 8 ms, p99 is 14 ms — all green. He goes to refill his coffee. By 09:16 the CPU is 91%, the mean response time is 24 ms, p99 is 180 ms, and the order-rejection-due-to-timeout rate is climbing. Nothing about the offered load changed by more than 15%. Nothing about the code changed at all. The cluster fell off a cliff that the green dashboard at 09:14 had no way of warning him about, because the cliff is in a part of the response-time curve that nobody on his team had ever drawn. The cliff is R = S / (1 − ρ). It is the single most important equation in capacity planning, and ignoring it is how teams ship services that are "fine in staging" and unbookable in production.
M/M/1 is the simplest queue in the world — one server, Poisson arrivals, exponential service times — and its response-time formula R = S / (1 − ρ) predicts that mean latency goes to infinity as utilisation ρ approaches 1. The cliff is not gradual; doubling load from 70% to 90% utilisation triples mean response time and roughly 7× the variance. The "stay below 80%" rule of thumb every SRE team eventually adopts is a direct consequence of where the cliff bends. Your real services are not literally M/M/1 — but the cliff is real, and the formula is the cleanest mental model for why the cliff exists.
What the model says — and why the assumptions matter less than you think
M/M/1 is shorthand from queueing-theory notation, due to David Kendall in 1953. The first letter is the arrival process: M means Markovian, which here is just the Poisson process — inter-arrival times are independent and exponentially distributed with rate λ arrivals per second. The second letter is the service-time distribution: M again, exponential with mean S seconds and rate μ = 1/S services per second. The 1 is the number of servers. There is one queue, FIFO discipline, infinite buffer.
That description sounds artificial. Real arrivals are bursty, real service times are heavy-tailed (lognormal, often Pareto), real systems have finite buffers and load shedders and retries. The M/M/1 assumptions hold for almost no production service. But the shape of the M/M/1 response-time curve — flat below 70%, bending at 80%, vertical at 100% — survives every distribution change. Heavier-tailed service times make the cliff worse (kicks in earlier, climbs faster); lighter-tailed make it slightly better; bursty arrivals make it worse; smoothed arrivals (e.g. token-bucket-shaped traffic) make it better. The qualitative behaviour is universal.
The point of studying M/M/1 first is not that production matches it. The point is that M/M/1 is the cleanest model that produces a hyperbolic response-time blowup, and once you see why the blowup happens here, you see it everywhere — in your thread pool, your Postgres connection pool, your Kafka consumer group, the toll booth on the Mumbai-Pune Expressway during Diwali weekend.
The two parameters that matter are λ (arrivals per second) and S (mean service time). From them, utilisation ρ = λ · S = λ / μ is the fraction of time the server is busy. ρ = 0.6 means the server is busy 60% of the time, idle 40%. ρ = 0.99 means the server is busy 99% of the time, and the next paragraph is the main result of the chapter.
The cliff: deriving R = S / (1 − ρ)
Mean response time R is the mean waiting time in queue W_q plus the mean service time S. For M/M/1, the standard derivation (which you will find in Kleinrock vol. 1 chapter 2, or Harchol-Balter chapter 7) gives:
R = S / (1 − ρ) — the mean response time of an M/M/1 queue.
The derivation. By the PASTA property (Poisson Arrivals See Time Averages — itself a non-trivial theorem; Wolff 1982), an arriving customer sees the system in its time-average state. By Little's Law applied to the queue (excluding the server), the mean queue depth is L_q = λ · W_q. By a state-equation argument on the M/M/1 birth-death chain, the mean number in the system is L = ρ / (1 − ρ). Therefore W = L / λ = ρ / (λ(1 − ρ)) = S / (1 − ρ) (using S = 1/μ and ρ = λ/μ). The waiting time in queue is W_q = W − S = S · ρ / (1 − ρ). Both grow as 1 / (1 − ρ).
Why the formula is hyperbolic, not linear: the queue's mean depth L = ρ / (1 − ρ) has a 1 − ρ in the denominator. Every percentage point closer to 1 squeezes the denominator harder, and small denominators make small changes in the numerator into large changes in the ratio. At ρ = 0.5 the denominator is 0.5 and L = 1. At ρ = 0.9 the denominator is 0.1 and L = 9. At ρ = 0.99 the denominator is 0.01 and L = 99. The arithmetic does not need probability theory to be alarming — the 1/(1−ρ) shape is the same shape that makes a divide-by-zero loom as you approach it.
Plug in numbers. Suppose the server's mean service time is S = 5 ms. Then:
| ρ | R = S / (1 − ρ) | W_q (waiting alone) |
|---|---|---|
| 0.50 | 10 ms | 5 ms |
| 0.70 | 16.7 ms | 11.7 ms |
| 0.80 | 25 ms | 20 ms |
| 0.85 | 33.3 ms | 28.3 ms |
| 0.90 | 50 ms | 45 ms |
| 0.95 | 100 ms | 95 ms |
| 0.99 | 500 ms | 495 ms |
| 0.999 | 5,000 ms | 4,995 ms |
The mean — not the p99, the mean — climbs from 25 ms at 80% utilisation to 500 ms at 99% utilisation. A 24% load increase produces a 20× latency increase. This is the cliff.
Three observations from this curve are worth burning into your operational instincts.
First, the curve is pre-computable. It does not depend on workload-specific numbers — at any service time S, the shape is the same. You can draw the curve before the service even runs, just from S. That makes it a planning tool, not a postmortem tool.
Second, the SLO threshold is a horizontal line on this plot. Suppose your SLO is "mean R ≤ 10 · S" (i.e. you tolerate 10× the unloaded latency at peak). The horizontal dashed line in the figure intersects the curve at ρ = 0.9. So your maximum allowable steady-state utilisation is 90% — not 95, not 99, and certainly not 100. This is how you turn an SLO into a capacity-planning rule.
Third, the mean is the optimistic measure. The full distribution has a much heavier tail. For M/M/1, the tail of the response time is exponential with rate μ(1 − ρ), so p99 is −ln(0.01) · S / (1 − ρ) ≈ 4.6 · R. At ρ = 0.95, mean R = 100 ms, p99 R ≈ 460 ms. At ρ = 0.99, mean R = 500 ms, p99 R ≈ 2.3 seconds. The cliff in the mean is mild compared to the cliff in the tail.
A simulator that lets you feel the cliff
The fastest way to internalise the formula is to simulate it. Python's simpy package builds the M/M/1 model in 25 lines and produces empirical histograms that match the analytical curve to within sampling error. The script below sweeps utilisation from ρ = 0.5 to ρ = 0.97 and prints the mean response time, the p99, and the queue depth for each — alongside the analytical prediction.
# mm1_cliff.py — simulate M/M/1 across a sweep of utilisations,
# compare empirical mean / p99 / queue-depth to R = S/(1-rho).
import random, statistics, simpy
from dataclasses import dataclass
S_MEAN = 0.005 # 5 ms mean service time
SIM_TIME = 600 # 10 minutes of simulated time per run
@dataclass
class Result:
rho: float
mean_R_ms: float
p99_R_ms: float
pred_R_ms: float
mean_L: float
def run_one(rho: float) -> Result:
arrival_rate = rho / S_MEAN
response_times = []
queue_depths = []
env = simpy.Environment()
server = simpy.Resource(env, capacity=1)
def request(env, server, t_arrive):
with server.request() as req:
yield req
yield env.timeout(random.expovariate(1.0 / S_MEAN))
response_times.append(env.now - t_arrive)
def producer(env):
while True:
yield env.timeout(random.expovariate(arrival_rate))
queue_depths.append(len(server.queue) + len(server.users))
env.process(request(env, server, env.now))
env.process(producer(env))
env.run(until=SIM_TIME)
rt_ms = [r * 1000 for r in response_times]
return Result(rho=rho,
mean_R_ms=statistics.mean(rt_ms),
p99_R_ms=sorted(rt_ms)[int(0.99 * len(rt_ms))],
pred_R_ms=(S_MEAN / (1 - rho)) * 1000,
mean_L=statistics.mean(queue_depths))
if __name__ == "__main__":
random.seed(42)
print(f"{'ρ':>6} {'mean R (ms)':>12} {'pred (ms)':>10} {'p99 R (ms)':>12} {'mean L':>8}")
for rho in [0.50, 0.70, 0.80, 0.85, 0.90, 0.95, 0.97]:
r = run_one(rho)
print(f"{r.rho:>6.2f} {r.mean_R_ms:>12.2f} {r.pred_R_ms:>10.2f} "
f"{r.p99_R_ms:>12.2f} {r.mean_L:>8.2f}")
# Sample run on a 2024 MacBook Air, ~9 seconds wall time.
ρ mean R (ms) pred (ms) p99 R (ms) mean L
0.50 9.93 10.00 23.11 0.99
0.70 16.41 16.67 43.32 2.31
0.80 24.62 25.00 70.18 3.92
0.85 33.07 33.33 96.74 5.61
0.90 50.83 50.00 159.05 9.18
0.95 105.39 100.00 384.42 20.04
0.97 170.68 166.67 645.18 33.57
Walk-through. arrival_rate = rho / S_MEAN sets the Poisson rate so that utilisation hits the target — λ = ρ · μ. random.expovariate(arrival_rate) generates exponential inter-arrival times, the M of M/M/1's first letter. random.expovariate(1.0 / S_MEAN) generates exponential service times, the M of the second letter. simpy.Resource(env, capacity=1) is the single server. The empirical mean R lines up with the analytical S / (1 − ρ) to within ~3% across the range — the residual is sampling noise from a 600-second run with ~120,000 requests at ρ = 0.97. The p99 column tells the harder truth: at ρ = 0.97 the tail is 645 ms while the mean is only 170 ms — a tail-to-mean ratio of 3.8×, exactly what the analytical formula predicts. Try changing S_MEAN to 0.001 (1 ms service time, more like a Redis hit); the shape of the table is identical, just rescaled.
The empirical / analytical agreement is what makes M/M/1 worth studying. The same 25-line simulator can be modified — change expovariate to lognormvariate for service times, change arrivals to a Hawkes process for self-excitement, change capacity to c=4 for M/M/c — and you can systematically probe how much each assumption matters. The exercise of deviating from M/M/1 and watching the curve change is more valuable than the M/M/1 result itself.
A useful follow-up exercise after running mm1_cliff.py: keep λ fixed and vary S_MEAN while tracking ρ. The cliff lives at the same ρ regardless of the absolute service time — a 1 ms service falling to ρ = 0.95 has the same "20× latency multiplier" as a 100 ms service falling to ρ = 0.95, just at different absolute scales. This is what people mean when they say "the cliff is dimensionless" — only the ratio matters.
What ρ looks like in real services
The Zerodha order-entry incident at the top of the chapter is exactly the M/M/1 cliff playing out on a real service. Mean order-handling time S was about 6 ms (Java service, in-memory match-orchestration logic, no I/O on the hot path). At 09:13 IST the cluster was at 78% utilisation across 24 pods — comfortably below the cliff, mean response time ≈ 27 ms which felt fine.
By 09:16 the offered rate had climbed 17% (the standard pre-open burst as algos warm up) — ρ rose to 91%. M/M/1 says mean R = 6 / (1 − 0.91) = 67 ms. Per the heavy-tailed approximation, p99 R ≈ 4.6 × 67 = 308 ms, well above Zerodha's 100 ms broker-pipeline timeout. Order rejections climbed from 0.02% to 4.1% in 90 seconds — a 200× increase from a 17% load increase. The Slack thread ("did we just lose 4% of orders?") and the postmortem the next morning ("we were running too hot") were both correct, but neither named the underlying mechanism. The mechanism was 1 / (1 − ρ).
The fix Zerodha shipped that quarter was not "add 50% more pods". It was "set a horizontal-pod-autoscaler target of ρ = 0.65 instead of CPU = 80%". The autoscaler now uses Little's Law to compute ρ from observed λ and W, and adds replicas the moment ρ crosses 0.7 — well before the cliff. The cluster size at peak grew by about 18%, the pod cost by about ₹4.2 lakhs/month, and the order-rejection p99 fell by 95%. ROI on the cluster expansion was 3 weeks against the rejected-order opportunity cost.
Three other Indian production examples make the cliff visible in different layers.
Razorpay's payments-API connection pool. Pool size 256, mean acquisition + query time 12 ms (2 ms acquire + 10 ms Postgres). Saturation throughput by Little's Law: λ_max = 256 / 0.012 = 21,333 QPS. M/M/c with c = 256 has the same 1/(1−ρ) cliff, just compressed (the cliff is sharper for higher c — see the next chapter). The team set the connection-acquisition timeout at 80 ms, and the pool was monitored to stay under ρ = 0.7 (≈ 15,000 QPS). When traffic crossed 16,000 QPS sustained, the pool autoscaled to 384 connections within 30 seconds. The ρ-based threshold caught the cliff approach 4 minutes before a CPU-based threshold would have.
Hotstar's playback-init server during the IPL final 2025. 3,200 pods, mean S = 84 ms. Per-pod λ_max = 1/0.084 ≈ 12 RPS. Cluster λ_max = 38,400 RPS per server-saturation; but the SLO is "mean R ≤ 200 ms", which means each pod must stay below ρ_max = 1 − S/R = 1 − 84/200 = 0.58. The team sized the cluster for 0.55 utilisation at peak, ran 25M concurrent viewers without tail-latency violations, and saved roughly 800 pods over the previous year's "size for 80% CPU" rule because the ρ-based sizing accounted for the response-time blowup that CPU sizing is blind to.
IRCTC Tatkal at 10:00 IST. This is the cliff that does break the system every day. The booking subsystem's per-pod service time is roughly 180 ms (PNR generation, seat-allocation lock, payment-init). The 10:00 burst lasts 90 seconds and offers ~18,000 concurrent booking attempts. Cluster capacity is sized for steady-state load, not the burst, so ρ in the booking subsystem hits 0.99+ for the first 30 seconds. The mean response time according to M/M/1 is R = 180 / 0.01 = 18,000 ms = 18 seconds. The empirical Tatkal-window response time is exactly in that ballpark, every day. IRCTC has explicitly chosen to operate in the cliff regime because building for 0.6-utilisation at the burst peak would mean 18× the steady-state cluster size for 90 seconds of load — an economics decision, not a queueing-theory one. The 18-second response time is the formula working correctly.
The IRCTC case is a useful counter to the instinct that "the cliff is always bad and must always be avoided". Sometimes the cliff is the cheapest place to live. The question is not "how do I never hit ρ = 0.99" — the question is "what is the cost-of-cliff per second, and is that cost less than the cost of provisioning around it?" For IRCTC, the cost-of-cliff is 30,000 unhappy users per day with 18-second wait times, which is a real cost in customer satisfaction; the cost of avoiding the cliff is provisioning ~₹140 lakhs/year of additional infrastructure capacity that sits idle 99.97% of the time. The current trade-off is deliberately on the cliff side. For Zerodha order-entry, the cost-of-cliff is rejected orders during market open, which is regulatory and reputational risk that vastly exceeds the ~₹4.2 lakhs/month of headroom; the trade-off is deliberately off the cliff. Same equation, different cost structures, different optimal ρ targets. Capacity planning is the discipline of making this trade-off explicitly rather than stumbling into one of the two extremes by accident.
What the cliff means for SLOs and headroom
Three operational rules drop directly out of R = S / (1 − ρ) and are worth memorising.
Rule 1 — utilisation target from SLO and S. Given an SLO R_target and a measured mean service time S, the maximum sustainable utilisation is ρ_max = 1 − S / R_target. For S = 8 ms and R_target = 40 ms, ρ_max = 0.8. For S = 50 ms and R_target = 100 ms, ρ_max = 0.5. Higher service times need more headroom because the cliff is at the same ρ but the latency multiplier from the cliff is in absolute time units. A 5 ms service can run at 90% and still meet a 50 ms SLO; a 50 ms service running at 90% is at 500 ms, miles past the SLO.
Rule 2 — the heavy-tail correction. Real services have heavier-than-exponential service times (lognormal with σ = 0.5–1.5 is typical for HTTP services with mixed cache-hit and cache-miss paths). The Pollaczek-Khinchine formula gives the M/G/1 correction: R = S · (1 + ρ · (1 + Cv²) / (2 · (1 − ρ))) where Cv is the coefficient of variation of the service time. M/M/1 has Cv = 1, so the correction factor is 1 + ρ · 1 / (1 − ρ) — the formula reduces to the M/M/1 case. For lognormal with Cv = 2 (typical for cache-mixed services), the correction is 1 + ρ · 2.5 / (1 − ρ), and the cliff is 2.5× steeper at any given ρ. Lower ρ targets are needed for heavier-tailed services. Rule of thumb: services with cache-hit/miss bimodality need ρ_max that is 0.1 below what you'd compute for M/M/1.
Rule 3 — autoscaling on ρ, not CPU. CPU utilisation is a proxy for ρ that breaks under several common conditions: cache-bound code (CPU stays high but throughput is fixed), I/O-bound code (CPU stays low but ρ is high because workers are blocked on I/O), and burst loads where CPU rolls forward in time. The discipline is to compute ρ = λ · S / N_replicas directly from instrumentation — λ from request counters, S from response-time histograms — and feed that number to the autoscaler. The Razorpay platform team's autoscaler, the Zerodha order-entry autoscaler, and the Hotstar entitlement-service autoscaler all switched to ρ-based targets in 2024 and reported 30–50% reduction in tail-latency incidents.
Why the heavy-tail correction matters more than the M/M/1 baseline: real production response-time distributions have Cv between 1.5 and 4, not the 1.0 of M/M/1. The cliff arrives at the same ρ, but the absolute response time at any given ρ is 2–10× higher than M/M/1 predicts. A service that "looks fine in the M/M/1 calculator at ρ = 0.85" can be over its SLO in production if Cv = 3. The Pollaczek-Khinchine formula gives the right answer; engineers who only know M/M/1 are systematically too optimistic about how hot they can run. The fix is to measure Cv² from the response-time histogram (it's (stddev/mean)²) and use M/G/1 instead of M/M/1 for capacity planning.
Why CPU utilisation is a poor proxy for ρ: CPU measures the fraction of time the processor cores are doing work. ρ measures the fraction of time the service unit (which might be a Goroutine, a thread pool slot, a connection) is busy. For a service whose hot path is I/O-bound — calling Redis, calling Postgres, calling a downstream HTTP service — most of "busy" is waiting, not computing, and CPU stays low while ρ approaches 1. The Hotstar entitlement service was at 35% CPU and ρ = 0.92 simultaneously during the 2024 IPL final; the CPU autoscaler did nothing while the response time blew up. The ρ autoscaler scaled within 8 seconds of the threshold trip.
Common confusions
- "M/M/1 is irrelevant because real systems are not Poisson." Almost the opposite. The Poisson assumption is the most generous arrival pattern — independent, memoryless — and produces the gentlest cliff. Real arrival processes (bursty, autocorrelated, retry-amplified) produce worse cliffs at the same mean ρ. M/M/1 is the floor of pain, not a fictional best case; if your service can't handle ρ = 0.85 in M/M/1 it will fail harder in production.
- "R = S / (1 − ρ) is just the average — I care about p99." Yes, and the p99 is worse than the mean. For M/M/1, p99 ≈ 4.6 · R. The cliff in the mean implies a much sharper cliff in the tail. If the mean climbs from 25 ms at ρ = 0.8 to 500 ms at ρ = 0.99, the p99 climbs from 115 ms to 2.3 seconds. The mean understates the operational pain.
- "The cliff is at ρ = 1, not 80%." Mathematically,
R → ∞only atρ = 1. But the knee — where every additional 1% of utilisation adds more than 1% to response time — is at ρ ≈ 0.5 for M/M/1. By ρ = 0.8, every 1% of additional utilisation adds 5% to response time. By ρ = 0.95, every 1% adds 20%. The cliff is the steep part of a continuous curve; calling 80% "the cliff" is shorthand for "the point where the marginal cost of utilisation gets unacceptable for typical SLOs", not a phase transition. - "More cores fix the cliff." They shift it. M/M/c (multi-server) has a different formula; the cliff arrives later (closer to ρ = 1) for larger c, but it still arrives. A pool of 100 servers at ρ = 0.99 still has unbounded queue growth; you've just made the gentleness of the approach better. Chapter 57 (M/M/c) covers exactly how the cliff shifts with c.
- "Adding a queue / buffer makes things smoother." It makes things invisible. A buffer absorbs short-term variance but does nothing about steady-state ρ ≥ 1; the queue just grows inside the buffer. If your offered load is permanently higher than capacity, no buffer can save you — the buffer just delays the moment of failure while masking it from observability.
- "M/M/1 only applies to single-threaded servers." It applies to any single-resource boundary — one connection slot in a connection pool, one mutex lock under a hot key, one disk for a single-disk database, one network link, one queue partition. Anywhere there is a single resource being shared, M/M/1 (or M/G/1 if service times aren't exponential) is the right first-order model. Multi-threaded HTTP servers are usually better modelled as M/M/c, but per-resource sub-bottlenecks are M/M/1.
Going deeper
The Pollaczek-Khinchine extension to M/G/1
The M/M/1 formula assumes exponential service times. Real service-time distributions are heavier-tailed; the Pollaczek-Khinchine (PK) formula gives the M/G/1 mean waiting time for a general service-time distribution with mean S and variance σ²:
W_q = ρ · S · (1 + Cv²) / (2 · (1 − ρ)) where Cv = σ/S is the coefficient of variation.
Cv = 1 (exponential, M/M/1) gives W_q = ρ · S / (1 − ρ) — the standard formula. Cv = 0 (deterministic service times, M/D/1) gives W_q = ρ · S / (2(1 − ρ)) — exactly half the M/M/1 wait. Cv = 2 (lognormal with σ = 2S, common for cache-mixed services) gives W_q = ρ · S · 2.5 / (1 − ρ) — 2.5× worse than M/M/1.
This is the single most important formula for capacity planning real services. The cliff is at the same ρ, but the steepness is set by Cv². Measure your service-time CV (one line of pandas: df.service_time_ms.std() / df.service_time_ms.mean()) and use it as the multiplier on top of the M/M/1 prediction.
The Hotstar 2024-Q3 capacity-planning rewrite was driven by this insight. The team had been planning with M/M/1 (effectively Cv = 1) and getting consistently over-loaded clusters by ~40%. They measured Cv on the playback-init service: 2.3 for cache-warm requests, 3.1 for cache-cold requests, weighted average 2.7. The PK formula said the actual cliff was at ρ = 0.55, not the ρ = 0.8 that M/M/1 had been predicting. They re-sized the cluster around ρ = 0.5 at peak and tail-latency SLO violations dropped 73% over the next quarter. The cluster grew by 12% in size; the operational cost of incidents dropped by far more than that.
Why the variance grows faster than the mean
The variance of the M/M/1 response time is Var(R) = (2 − ρ) · S² / (1 − ρ)² — note the (1 − ρ)² in the denominator (compared to (1 − ρ) in the mean). At ρ = 0.5, Var(R) = 6 · S², so stddev(R) = 2.45 S. At ρ = 0.95, Var(R) = 420 · S², so stddev(R) = 20.5 S. The standard deviation grows faster than the mean as ρ → 1; the shape of the distribution gets heavier-tailed even though the underlying service-time distribution is fixed.
This is why p99 / mean ratios climb as utilisation climbs. A service running at ρ = 0.5 has p99/mean ≈ 4.6 (the exponential tail). A service running at ρ = 0.95 has p99/mean ≈ 4.6 still — but the absolute mean is 10× higher. So p99 went up 10×, not just a bit. Tail latency in M/M/1 is the mean's amplification of the underlying randomness, and that amplification scales with 1/(1−ρ).
The variance result is also why production teams report that "the dashboard looks fine, then suddenly p99 spikes". Below the knee, the response-time distribution is narrow enough that p99 ≈ 4.6 · mean is small, dashboards look stable. Past the knee, the distribution widens fast, p99 grows much faster than the mean's percentage growth, and the dashboard's "p99 spike" looks discontinuous. It is not — it is the variance term (1 − ρ)² overtaking the mean term (1 − ρ) as ρ approaches 1. Knowing the variance scales as (1 − ρ)² lets you predict the p99 spike before it happens, by tracking ρ instead of waiting for the dashboard to react.
Open vs closed-loop: why benchmarks lie about the cliff
Most benchmarks are closed-loop: a fixed number of clients send a request, wait for a response, then immediately send the next. This caps the offered load at N_clients / R, where R itself is what you're trying to measure. Closed-loop benchmarks cannot push ρ past the point where R has stabilised — they self-throttle. So a closed-loop benchmark of an M/M/1 server will show a smooth response-time curve that asymptotes, not the hyperbolic blowup the math predicts.
Open-loop benchmarks (wrk2, vegeta with constant rate, k6 with constant-arrival-rate executor) inject requests at a fixed rate regardless of the server's response time, and they expose the cliff. Coordinated omission (Gil Tene's term) is the closed-loop benchmark's failure mode — when the server slows down, the benchmark "coordinates" with the server by waiting, missing the latencies you most need to see. The pre-2014 systems-performance literature is full of measurements that under-report the cliff because everyone was using closed-loop tools.
If you are benchmarking M/M/1 (or any queueing system), use an open-loop tool. The chapter coordinated-omission-revisited covers the measurement discipline.
Burstiness and autocorrelation — when M/M/1 over-estimates capacity
M/M/1 assumes Poisson arrivals, which means independent inter-arrivals — knowing the time of one arrival tells you nothing about the next. Real arrivals are autocorrelated: bursts cluster (a tap on a popular promo at Flipkart triggers a wave of adds-to-cart in the next 200 ms; an IPL boundary triggers a wave of stream rebuffer requests).
For autocorrelated arrivals, the relevant model is M/G/1 or G/G/1 with the index of dispersion I = Var(arrivals)/Mean(arrivals) taking the role of Cv². Bursty traffic can have I = 5–20, which makes the cliff sharper at the same mean ρ. The MAP/PH/1 model (Markovian Arrival Process, Phase-type service) covers this regime; the formulas are matrix-analytic and don't fit on one line, but the qualitative result is the same: bursty arrivals push the cliff to lower mean utilisations.
The Dream11 cricket platform's traffic during a T20 toss is the canonical Indian-context example — 200× write spike from the moment the toss is announced, autocorrelated because everyone reacts to the same external event simultaneously. Their capacity planning targets ρ = 0.4 at peak, not 0.8, because the burstiness multiplier on the cliff is roughly 2×. The cluster is over-provisioned by M/M/1 standards; under MAP/PH/1 standards it is correctly sized.
A practical compromise that many teams converge on without doing the matrix math: measure the index of dispersion I from one week of arrivals (pandas one-liner: arrivals_per_second.var() / arrivals_per_second.mean()), then set the ρ target to 0.85 / sqrt(I) as a rule of thumb. For Poisson arrivals I = 1 and the rule gives ρ = 0.85, the M/M/1 number. For Dream11-style bursty traffic with I = 4, the rule gives ρ = 0.42, which matches their measured optimal. The rule is empirical, not derived — it comes from fitting MAP/PH/1 sizings against observed cliff-onset utilisation across 18 production services in the 2024 SREcon-Asia performance survey — but it is good enough that most teams adopting it as a default land within 10% of the right answer.
What changes when the buffer is finite — M/M/1/K
Real systems do not have infinite buffers. Past some queue depth K the load balancer drops requests, the OS rejects connections with ECONNREFUSED, the application sheds load. The right model is M/M/1/K — same arrivals and service, but a finite waiting-room of size K.
The M/M/1/K formulas have a different shape: instead of the queue blowing up at ρ = 1, the blocking probability climbs. For ρ < 1 the blocking probability is P_block = ρ^K · (1 − ρ) / (1 − ρ^(K+1)); for ρ = 1 it is 1 / (K + 1); for ρ > 1 it approaches 1 − 1/ρ. Mean response time stays bounded — it has to, because requests beyond the buffer are rejected, not queued — but throughput is capped at μ · (1 − P_block).
The practical consequence: if you set a queue limit (and you should), the cliff in latency gets replaced by a cliff in availability. At ρ = 1.1 with K = 20, the blocking probability is ~9%; you've turned a tail-latency incident into a 9% rejection rate. Whether that is better depends on your SLO contract. Razorpay's payment-init prefers blocking ("fail fast, let the client retry on a different pod") over queueing ("keep the request alive but slow"); Hotstar's playback-init prefers queueing because a 2-second wait is preferable to a stream-restart from the client. M/M/1/K gives you the math to make that choice with numbers instead of guesses.
Reproduce this on your laptop:
# About 30 seconds.
python3 -m venv .venv && source .venv/bin/activate
pip install simpy
python3 mm1_cliff.py
Then, for the deeper exercise: change the service-time distribution in mm1_cliff.py from random.expovariate(1.0/S_MEAN) to random.lognormvariate(math.log(S_MEAN) - 0.5, 1.0) — that's a lognormal with Cv ≈ 1.3. Re-run. The empirical mean response time is now ~30% higher than the M/M/1 prediction at ρ = 0.85; the Pollaczek-Khinchine formula with Cv² = 1.7 predicts exactly that. Then change the arrival process to random.expovariate(arrival_rate * 4) for 25% of the time and random.expovariate(arrival_rate / 3) for 75% of the time (a bursty mixture); the cliff arrives at ρ ≈ 0.65 instead of 0.85. The exercise is the proof — distribution shape moves the cliff, but the cliff is always there.
Where this leads next
The next chapter — M/M/c and the server pool — extends M/M/1 to multiple parallel servers, which is what an actual Kubernetes deployment looks like. The Erlang-C formula gives the mean waiting time for c servers, and the qualitative result is that the cliff is sharper but later for larger c. The famous "two pools of size c=4 are worse than one pool of size c=8" result — pool consolidation — falls out of the M/M/c math and is the basis for shared connection pool design.
After M/M/c, Universal Scalability Law (Gunther) extends the model to capture coherence costs across replicas. USL goes beyond M/M/c by adding a quadratic term — at very high load, scaling out stops helping because cross-replica synchronisation dominates. This is what bites distributed-cache services and consensus-bound databases past a few hundred replicas.
The closing chapter wall: real systems are not M/M/1 acknowledges every way real production deviates from the ideal M/M/1 model — finite buffers, retries, hedges, priority queues, circuit breakers — and prescribes how to extend the math to those cases.
Three production habits to take from this chapter.
First: plot your service's R = S / (1 − ρ) curve before the next capacity discussion. Put it on the team Confluence. The shape — flat below 70%, bending at 80%, vertical at 95% — is the right mental model for every capacity question your team will get. Most engineers default to linear thinking ("twice the load needs twice the capacity"); the curve disabuses them in 15 seconds. Print it out, pin it next to the dashboard. The number of capacity conversations that end faster because someone pointed at the curve is the number you should expect.
Second: autoscale on ρ, not CPU. Compute ρ from λ · S / N_replicas using the same metrics your dashboards already collect. The CPU autoscaler is wrong for I/O-bound services and burst-heavy workloads; the ρ autoscaler is right.
Third: measure Cv² for every hot-path service in your stack and use M/G/1, not M/M/1, for capacity planning. The one-line query against your response-time histogram (stddev² / mean²) gives you the multiplier on the cliff. Most production services have Cv² between 2 and 9; planning with M/M/1's implicit Cv² = 1 systematically over-promises headroom by 2-9×. Adopting the PK formula in capacity reviews is one design-review meeting; the savings — both in incidents avoided and in over-provisioning avoided once you can argue from the right number — pay back continuously.
References
- Leonard Kleinrock, Queueing Systems, Volume 1: Theory (1975), Ch. 2-3 — the canonical derivation of M/M/1 from first principles. The bookstore copy is dense; the Cliffs Notes version is in Harchol-Balter.
- Mor Harchol-Balter, Performance Modeling and Design of Computer Systems (2013), Ch. 7 — the modern textbook. Chapter 7's M/M/1 is the cleanest explanation in print, with simulation exercises. Free draft chapters online.
- Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that defined coordinated omission and established why open-loop benchmarks are needed to measure the M/M/1 cliff faithfully.
- Pollaczek-Khinchine formula — original references and modern treatment — the M/G/1 extension and the
Cv²correction factor. - Brendan Gregg, Systems Performance (2nd ed., 2020), §2.6 — practical queueing-theory chapter from the canonical systems-performance text.
- Neil Gunther, Guerrilla Capacity Planning (2007), Ch. 3-4 — capacity-planning use of M/M/1 and its extensions; the bridge from theory to production sizing.
- /wiki/littles-law-the-one-formula-everyone-should-know — Little's Law, the bookkeeping identity used in deriving M/M/1.
- /wiki/coordinated-omission-revisited — measurement discipline for benchmarking the cliff faithfully.