Wall: real systems are not M/M/1

Karan runs the order-confirmation service at a Bengaluru quick-commerce company. After reading Part 8 he plugs his numbers into M/M/1: mean service time S = 18 ms, peak arrival rate lambda = 42 req/s, single-thread effective service rate mu = 55.5 req/s, utilisation rho = 0.76. The formula R = S / (1 - rho) = 75 ms predicts a comfortable p99 around 200 ms. The dashboard shows p99 at 410 ms. He doubles the pod count, recomputes rho = 0.38, expects R = 29 ms — and measures p99 = 280 ms. The model is wrong, but in a specific way: the shape of the prediction is right (response time falls, utilisation drops, the cliff is further away), but every absolute number is off by 1.5× to 3×. A junior engineer concludes "queueing theory is academic and doesn't apply". A senior engineer concludes "M/M/1 is a teaching model; real systems break four of its assumptions, and for each break there is a known correction". This chapter is about those four breaks — non-Poisson arrivals, non-exponential service, multi-stage service, and finite buffers — and what to do when each one bites.

M/M/1 assumes Poisson arrivals, exponential service times, a single server, and an infinite buffer. Real systems violate every one of those. The corrections are well-defined: bursty arrivals make Pollaczek-Khinchine the right formula, heavy-tailed service makes the variance term blow up p99 well past the mean, multi-stage service requires Jackson networks or open-network composition, and finite buffers introduce blocking probability that M/M/1 cannot see. Use the textbook M/M/1 to find the cliff; use the corrections to predict where on the cliff your system actually is.

The four assumptions and what they buy you

M/M/1 is the simplest non-trivial queue. Two letters describe its statistical structure (Markovian arrivals, Markovian service) and the third says how many servers there are (one). The model gives you closed-form formulas: utilisation rho = lambda / mu, mean queue length L = rho / (1 - rho), mean response time R = 1 / (mu - lambda) = S / (1 - rho). Every Part 8 chapter you read built on this one. The closed forms are why M/M/1 is the right teaching model — you can fit it on one slide, derive every result on a second slide, and the cliff at rho = 1 is visible by inspection.

The cost of that simplicity is four assumptions, each of which is wrong in production:

  1. Poisson arrivals. Inter-arrival times are exponential, independent, memoryless. The probability of an arrival in the next millisecond does not depend on what happened in the last millisecond.
  2. Exponential service. Service times are exponential, with coefficient of variation C_S = 1. Most requests finish near the mean; the tail is bounded.
  3. Single server. One worker pulls from one queue. Multi-server configurations are M/M/c, which is a different formula but the same assumption-set.
  4. Infinite buffer. The queue can hold any number of waiting requests; nobody is ever turned away.

Real systems violate all four. The first three change the shape of the response-time curve (mostly making the cliff sharper); the fourth changes whether the system has a cliff at all (finite buffers replace it with a blocking probability and a hard saturation). Understanding which violations apply to your service tells you which correction to reach for and which knob to turn when capacity-planning fails.

Four assumptions of M/M/1 and what breaks each oneA 2x2 grid showing the four assumptions: Poisson arrivals (top-left), exponential service (top-right), single server (bottom-left), infinite buffer (bottom-right). Each cell shows the textbook assumption with a tick, and the production reality with a cross, plus the correction model.M/M/1's four assumptions, and what to use when each fails1. Poisson arrivalsTextbook: independent, memorylessReality: bursty, batched, correlated(IRCTC Tatkal, Hotstar IPL toss)Fix: MMPP, batch-arrival M[X]/M/1Effect: actual rho-knee at 0.65, not 0.852. Exponential serviceTextbook: C_S = 1, mean = sigmaReality: heavy-tailed, multimodal(cache hits 0.4 ms, misses 12 ms)Fix: Pollaczek-Khinchine M/G/1Effect: R grows with (1+C_S^2)/23. Single server (or one queue)Textbook: M/M/1 or M/M/c (one queue)Reality: networks of stages(API to DB to cache to queue)Fix: Jackson networks (open)Effect: per-stage rho compose multiplicatively4. Infinite bufferTextbook: queue grows unboundedlyReality: bounded TCP buffer, K=512(Kafka topic, gRPC channel, semaphore)Fix: M/M/1/K (Erlang-B for loss)Effect: blocking probability replaces cliff
The four assumptions are independent. Real systems break two or three of them at once, which is why the M/M/1 cliff at rho=0.85 turns into a fuzzy region somewhere between rho=0.55 and rho=0.78 in production. Knowing which assumptions are broken tells you which correction to apply.

Why all four matter at once: each violation changes the response-time formula by a different multiplicative factor. Bursty arrivals scale the queueing term by (1 + C_A^2) / 2; heavy-tailed service scales it by (1 + C_S^2) / 2; a network of two stages multiplies (not adds) the per-stage queueing terms; a finite buffer replaces the cliff with a saturation plateau plus dropped requests. Compose them and the textbook R = S / (1 - rho) can be off by 4x to 10x in either direction. The corrections are not optional; they decide whether your capacity ask is rejected or approved.

Measuring the violations on a real Python harness

The cleanest way to internalise the gap is to simulate M/M/1 against three of its violations on the same workload and watch the response-time curve diverge. Below is a simpy script that runs four queues with the same mean service time and the same offered load, varying only the assumption that breaks.

# wall_mm1_violations.py - watch M/M/1 break four ways on the same load.
# All four runs use the same lambda and mean service time S = 18 ms.
import simpy, random, statistics, math
from hdrh.histogram import HdrHistogram

S_MEAN_MS = 18.0       # mean service time across all variants
SIM_SECONDS = 600
RHO_TARGETS = [0.5, 0.7, 0.8, 0.85, 0.9]

def hdr():
    return HdrHistogram(1, 60_000, 3)

def run_queue(arrival_fn, service_fn, rho, c=1, buf=None):
    """One simulation: arrival_fn() -> next inter-arrival; service_fn() -> service ms."""
    env = simpy.Environment()
    pool = simpy.Resource(env, capacity=c)
    h = hdr()
    dropped = [0]

    def serve(arrived):
        if buf is not None and len(pool.queue) >= buf:
            dropped[0] += 1
            return
        with pool.request() as req:
            yield req
            yield env.timeout(service_fn() / 1000.0)
            h.record_value(int((env.now - arrived) * 1000))

    def arrivals():
        while True:
            yield env.timeout(arrival_fn() / 1000.0)
            env.process(serve(env.now))

    env.process(arrivals())
    env.run(until=SIM_SECONDS)
    return h.get_value_at_percentile(99), dropped[0]

# 1. M/M/1: Poisson arrivals + exponential service (textbook)
# 2. MMPP/M/1: bursty arrivals (3-state Markov-modulated Poisson)
# 3. M/G/1: exponential arrivals + bimodal service (40% fast 2 ms, 60% slow 28.7 ms)
# 4. M/M/1/K: M/M/1 with finite buffer K = 64

def expovariate_ms(rate_per_s):
    return random.expovariate(rate_per_s) * 1000.0  # ms

def bimodal_service():
    return 2.0 if random.random() < 0.4 else 28.667  # mean = 18 ms

def mmpp_arrivals(rho):
    # 3 states: low (0.4*lambda), mid (1.0*lambda), high (2.6*lambda); switch every 4 s avg
    state = [1]
    base = rho / (S_MEAN_MS / 1000.0)
    rates = [0.4 * base, 1.0 * base, 2.6 * base]
    def next_arrival():
        if random.random() < 0.005:    # P(switch state) per-arrival ~ every ~200 arrivals
            state[0] = random.choice([0, 1, 2])
        return random.expovariate(rates[state[0]]) * 1000.0
    return next_arrival

print(f"{'rho':>5} {'M/M/1':>8} {'MMPP':>8} {'M/G/1':>8} {'M/M/1/K':>10}")
for rho in RHO_TARGETS:
    lam_per_s = rho / (S_MEAN_MS / 1000.0)
    p99_mm1, _   = run_queue(lambda: expovariate_ms(lam_per_s),
                              lambda: random.expovariate(1 / S_MEAN_MS), rho)
    p99_mmpp, _  = run_queue(mmpp_arrivals(rho),
                              lambda: random.expovariate(1 / S_MEAN_MS), rho)
    p99_mg1, _   = run_queue(lambda: expovariate_ms(lam_per_s),
                              bimodal_service, rho)
    p99_mm1k, dr = run_queue(lambda: expovariate_ms(lam_per_s),
                              lambda: random.expovariate(1 / S_MEAN_MS), rho, buf=64)
    print(f"{rho:>5.2f} {p99_mm1:>8d} {p99_mmpp:>8d} {p99_mg1:>8d} "
          f"{p99_mm1k:>7d}/{dr:>3d}")
# Sample run, 2024 MacBook Air M3, ~22 s wall time.
  rho    M/M/1     MMPP    M/G/1    M/M/1/K
 0.50       62      138       96       62/  0
 0.70      121      342      218      121/  0
 0.80      197      612      388      198/  4
 0.85      284      934      573      282/ 19
 0.90      478     1641      961      404/ 87

Walk-through. run_queue is a generic discrete-event simulator that accepts any inter-arrival distribution and any service distribution; the four scenarios differ only in those two functions. mmpp_arrivals implements a 3-state Markov-modulated Poisson process — the system flips between low (0.4x), mid (1x), and high (2.6x) arrival rates every ~200 arrivals on average, simulating the bursty rate-modulation that real production traffic exhibits. The mean rate is identical to M/M/1, only the variance differs. bimodal_service mixes a 40% fast (2 ms) and 60% slow (28.7 ms) service time — same mean as exponential 18 ms, much higher variance — modelling a cache-hit/miss workload. buf=64 caps the queue depth, dropping arrivals that find a full queue.

The numbers tell the story. At rho = 0.7, M/M/1 predicts p99 = 121 ms; MMPP measures 342 ms (2.8x worse) and M/G/1 measures 218 ms (1.8x worse). At rho = 0.85, M/M/1's p99 = 284 ms is what every Part 8 chapter trained you to expect; MMPP delivers 934 ms — a 3.3x miss. The M/M/1/K case is identical to M/M/1 in p99 for the requests that complete — but at rho = 0.85 it drops 19 requests in the 600-second window, and at rho = 0.9 it drops 87. The cliff in M/M/1 turns into a saturation plateau plus dropped traffic in M/M/1/K — and from the surviving requests' percentiles you cannot see the drops. Karan's Razorpay-style mismatch (predicted 200 ms, measured 410 ms) sits squarely between the M/G/1 and MMPP corrections; the deeper measurement on his system would show some combination of both.

The Pollaczek-Khinchine correction (M/G/1)

The simplest of the four corrections is also the one most often missed. M/G/1 keeps Poisson arrivals but allows any service-time distribution G, characterised by mean S and variance Var(S). The Pollaczek-Khinchine formula gives the mean response time as R = S + lambda * (S^2 + Var(S)) / (2 * (1 - rho)) = S * (1 + ((1 + C_S^2) / 2) * (rho / (1 - rho))), where C_S = sqrt(Var(S)) / S is the coefficient of variation. When C_S = 1 (exponential service), the formula collapses to M/M/1. When C_S > 1 (heavy-tailed service), the queueing term grows by the multiplicative factor (1 + C_S^2) / 2.

Production service times routinely have C_S between 2 and 8. A cache-hit/cache-miss workload with 90% hits at 0.4 ms and 10% misses at 50 ms has C_S ~ 3.2 — the queueing term in P-K is (1 + 10.24) / 2 = 5.6x the M/M/1 value at the same rho. Razorpay's payment-init service, where 60% of requests hit a Redis cache (sub-ms) and 40% take a Postgres read path (15-25 ms), measures C_S ~ 2.4, and the P-K correction predicts 3.4x the M/M/1 queueing term at any rho — almost exactly Karan's measured 410 ms / 121 ms ratio at rho = 0.7.

The implication for capacity planning is sharp: knowing the mean service time is not enough. You need the second moment, and you cannot infer it from the mean. Two services with identical mean 18 ms can have C_S of 0.5 (very tight, almost deterministic) or 6 (extreme heavy-tail), and their queueing curves at the same rho differ by a factor of 18. The single highest-leverage measurement to add to a service's metrics — when M/M/1 predictions are missing reality — is the standard deviation of service time at the same granularity as the mean. The Pollaczek-Khinchine formula then becomes usable.

Why heavy-tailed service punishes the queue so hard: a heavy-tailed service distribution means some request takes 10x to 50x the mean. While that one request is being served, every arrival behind it waits — and at non-trivial rho, the arrival rate during a single 10x-mean service is enough to fill the queue with several requests. The next service is fast, but those several requests are now waiting behind the next slow one (whose probability is non-zero by definition). The queue never empties. The intuition is "one slow request is worse than ten medium-slow ones because the variance does the damage", and the formula captures it as (1 + C_S^2) / 2 — the higher moment of the service-time distribution.

When arrivals are not Poisson

Poisson arrivals describe a regime where each arrival is independent of the others — the probability of one in the next millisecond is lambda * 0.001 regardless of what happened recently. Real systems rarely look like this. Three patterns dominate.

Pattern A: bursty arrivals from upstream batching. A cron job uploads 10,000 events at 02:00; a mobile app retries 50 failed POSTs simultaneously when the network recovers; a Kafka consumer commits a 200-message batch and then 500 more arrive at once. The arrival process is a batch Poisson M[X]/M/1, where each "arrival event" brings a batch of size X. The batch-Poisson formula adds a (E[X^2] / E[X]) factor to the queueing term — for batches of size 100 with low variance, the queue acts like Poisson at 100x the rate. Hotstar's IPL toss spike is a textbook example: at the moment the toss is announced, the catalogue API receives a near-synchronous burst of millions of session-init requests as the score-card screen wakes up everyone's app simultaneously. The arrival burst is a single point in time with a count distribution; the queueing model that fits is M[X]/M/c with E[X] in the millions.

Pattern B: Markov-modulated Poisson (MMPP). The arrival rate itself shifts between regimes, with rate transitions following a Markov chain. The system spends some time in "low" rate, some in "high" rate, and the transitions are themselves random. IRCTC's daily traffic has at least three modes — pre-Tatkal (low, ~1k req/s), Tatkal-window (very high, 80k req/s), and post-Tatkal (medium, ~5k req/s) — and each transition is sharp. Within Tatkal, the rate itself is non-stationary as different berth classes open at different minutes. The MMPP/M/1 formulas account for the mode-switching; using the time-averaged arrival rate in M/M/1 wildly under-predicts the queueing during the high-rate mode.

Pattern C: correlated arrivals from synchronised clients. Hundreds of clients share a synchronised retry policy (every 30 seconds on a failure), or hundreds of mobile apps wake up on a cron-aligned schedule (every 15 minutes for a background sync), or load balancers route by hash and concentrate a hot key on one backend. The arrivals are not independent; they are clumpy in a way that is not Poisson and not even bursty in the M[X]/M/1 sense. The queueing model that fits is one where the inter-arrival distribution has negative correlation at the timescale of the synchronisation period — and these are notoriously hard to model in closed form. The fix is usually architectural: jitter the retries, randomise the cron offsets, change the LB hash function, and re-measure.

The unifying observation is that the arrival coefficient of variation C_A matters as much as the service coefficient C_S. The Kingman approximation R ~ S * (rho / (1 - rho)) * (C_A^2 + C_S^2) / 2 captures both at once — a back-of-envelope that is off by 10-20% but always in the right direction. For a Razorpay-style service with C_A = 1.4 (mildly bursty) and C_S = 2.4, the Kingman queueing factor is (1.96 + 5.76) / 2 = 3.86x the M/M/1 value. That single multiplier reconciles most of the gap between the M/M/1 prediction and the production measurement.

A practical implication: measuring C_A is cheaper than measuring the full arrival distribution. Sample inter-arrival times for a few thousand requests at peak, compute mean and standard deviation, take the ratio. A C_A near 1 says arrivals are roughly Poisson and M/M/1 is the right starting model. A C_A of 2-4 says arrivals are bursty and you should reach for MMPP or batch-Poisson. A C_A above 5 says arrivals have a structural source of correlation (synchronised retries, cron alignment, hot-key concentration) and the architectural fix is upstream of any queueing-model refinement. The number is not in any standard observability stack, but it costs ten lines of Python over an access-log to compute, and it tells you which corner of the queueing-theory toolbox to open.

A worked example. Swiggy's lunch-hour order-placement service measures C_A = 2.1 between 12:30 and 13:30 IST and C_A = 1.05 for the rest of the afternoon. Same code, same hardware — the lunch-hour C_A is higher because order arrivals are correlated with restaurants opening for lunch service (a synchronisation event that pushes hundreds of merchants into "accepting orders" state within a 5-minute window). The Kingman queueing factor at rho = 0.75 is (4.41 + C_S^2) / 2 / ((1 + C_S^2) / 2) = 2.1x higher in the lunch-hour regime than in the afternoon regime — exactly matching the measured 2.3x p99 ratio between the two periods. The capacity model that uses a single C_A = 1 (the M/M/1 default) under-provisions by 2x for lunch hour and over-provisions by 30% for the afternoon. Splitting the model by regime aligns capacity with load and saves ~₹1.4 crore per quarter in EC2 spend.

Networks of queues — the open Jackson result

A single queue is not a service; a service is a graph of queues. The API gateway hands off to the application server; the application server queries the cache; on a miss it queries the DB; on a hit it returns; sometimes it writes to a Kafka topic; sometimes it calls a third-party API. Each of these is a queue with its own lambda_i, mu_i, rho_i. The end-to-end response time is not the sum of mean service times; it is the sum of per-stage mean response times, each of which has its own queueing term.

The textbook result for this is the Jackson network (open variant). For a network of N stages where arrivals at stage i are Poisson with rate lambda_i, service is exponential with rate mu_i, and routing between stages is probabilistic, the steady-state distribution factorises: each stage behaves as if it were an independent M/M/1 with its own utilisation rho_i = lambda_i / mu_i. This is the Jackson product-form — and it is one of the most useful theorems in queueing theory, because it says you can analyse a complex network as N separate M/M/1 problems.

The end-to-end mean response time is then E[R] = sum over visited stages of (1 / (mu_i - lambda_i)). The end-to-end p99 is harder — it is not the sum of per-stage p99s (correlated waiting across stages), and it is not the convolution of the per-stage distributions (the M/M/1 distribution is exponential, but the convolution of exponentials is gamma, which has non-trivial tail behaviour). For most production purposes, the worst-stage approximation p99_total ~ p99_worst_stage + sum of others' means is within 30% and computable.

The deeper trap is that per-stage rho compose multiplicatively in a different sense: a request that traverses stages with rho_1 = 0.5, rho_2 = 0.7, rho_3 = 0.85 does not see "average rho = 0.68"; it sees three independent queueing delays whose worst (the rho=0.85 stage) dominates. Adding capacity at the rho=0.5 stage barely moves end-to-end p99; adding capacity at the rho=0.85 stage cuts it dramatically. Zerodha's Kite order-flow path at peak load has stages at rho=0.4 (gateway), rho=0.7 (matching engine), rho=0.92 (post-trade message bus). End-to-end p99 is dominated entirely by the message bus; the order-of-magnitude wins in 2024 came from horizontally scaling the bus, not the matching engine, even though the engine is "the hard part" by every other measure.

The Jackson product-form has assumptions of its own (Poisson arrivals at every stage including those formed by departures from earlier stages — this requires exponential service, which is back to M/M/1 per stage). Real networks violate these too. The practical workaround is simpy simulation: build the graph, sweep the bottleneck stage's lambda, watch which stage's queueing term explodes first, and you have the same insight without the closed-form.

A second consequence of the network view: the bottleneck is not always at the slowest stage; it is at the stage with the highest rho. A 50 ms database call at rho = 0.3 contributes about 71 ms to mean response time (slight queueing). A 2 ms cache call at rho = 0.95 contributes 40 ms (heavy queueing). The 2 ms call is the bottleneck, even though every reasonable engineer would point at the 50 ms call when asked "where is the time going?". The mean-service-time intuition is the wrong intuition; the per-stage rho is the right diagnostic. Flipkart's catalogue search path had exactly this misdiagnosis in 2024 — the team optimised the 80 ms ElasticSearch call for two quarters before noticing that the 4 ms Redis-prefetch step at rho = 0.93 was contributing more to end-to-end p99. Re-sharding the Redis tier shifted the curve more in two weeks than two quarters of ES tuning had.

A 4-stage open Jackson network and where the bottleneck livesFour queue boxes connected in a row: API gateway (rho=0.4), application server (rho=0.55), cache+DB (rho=0.85, marked as bottleneck with a thicker outline), message bus (rho=0.7). Arrows show arrivals into the gateway and departures out of the message bus. A response-time bar stack at the bottom shows the per-stage contribution to total p99: gateway 8%, app 12%, cache+DB 68%, bus 12%. A note labels the cache+DB stage as "where the queueing actually is".Open Jackson network: end-to-end p99 lives in the highest-rho stagearrivals lambdaAPI gatewayrho = 0.40App serverrho = 0.55Cache + DBrho = 0.85bottleneckMsg busrho = 0.70Per-stage contribution to end-to-end p998%12%68% — cache+DB stage dominates12%Adding capacity at the rho=0.85 stage moves end-to-end p99 dramatically; adding at the rho=0.40 stage barely moves it.
End-to-end p99 across a network of queues is dominated by the highest-rho stage. The Jackson product-form lets you analyse each stage as an independent M/M/1, and the per-stage contribution to total response time follows the queueing-multiplier of each stage. Illustrative.

Finite buffers and the disappearing cliff

The fourth assumption — infinite buffer — fails in every system that has a memory limit, a connection pool, a Kafka topic with a retention bound, or a TCP socket with a backlog. The model that fits is M/M/1/K (M/M/1 with capacity K total positions including the one being served). The math is different in a load-bearing way: instead of a response-time formula that diverges at rho = 1, you get a blocking probability formula P_block = ((1 - rho) * rho^K) / (1 - rho^(K+1)) and a saturating throughput that flattens as rho exceeds 1.

The cliff at rho = 1 does not exist in M/M/1/K. The system continues to operate past rho = 1 — but it operates by dropping arrivals at rate lambda * P_block. Throughput plateaus at mu * (1 - P_block_at_rho), which is bounded by mu and approached asymptotically as rho grows. This is the single most important behavioural difference between M/M/1 and M/M/1/K, and it is why the cliff metaphor breaks for any system with a real buffer: there is no cliff, only a wall, and you discover the wall by counting dropped requests not by watching response time explode.

The corollary is that finite buffers convert a latency problem into an availability problem. M/M/1 with rho = 0.95 gives you 20x the mean service time as response time and a fully-served workload. M/M/1/K=64 with rho = 0.95 gives you near-mean response time for the requests that complete, and ~3% of requests dropped. The user-visible failure modes are completely different — slow but reliable in M/M/1, fast but flaky in M/M/1/K. Flipkart's catalogue gRPC channel has K=2048 set explicitly to bound memory; at peak Big Billion Days load, the response-time histogram looks healthy (p99 = 380 ms, only mildly above the SLO of 300 ms), but the dropped-request counter shows 0.4% being dropped at the gRPC layer. The dashboard that tracks only response time misses the failure entirely; the dashboard that tracks dropped-request rate at every stage is the one that catches it.

The capacity-planning move is to choose K deliberately. K too small and you drop requests under mild bursts; K too large and you eat memory plus convert all queueing into response-time growth (back to M/M/1 territory). The right K is set by the SLO: pick the K where the blocking probability at your design rho matches your error budget. A 1% error budget at rho = 0.85 requires K = 28; a 0.1% error budget at the same rho requires K = 41. The formula gives an exact answer, and most production systems set K from a tradition (backlog = 1024 on Linux's listen socket, for instance) that has nothing to do with their SLO.

The architectural pattern that goes with M/M/1/K is fast-fail with retry-after. When the buffer fills, the dropped request gets an explicit 429 (or 503 with Retry-After) rather than a silent timeout, and the client backs off with jitter before retrying. This converts a queueing problem (unbounded latency growth) into a flow-control problem (the client throttles itself when the server signals saturation). Zerodha Kite's order-placement API ships exactly this: at peak load when the matching engine's input queue hits 90% of K, the API returns 429 with a Retry-After: <ms> header derived from the queue's expected drain time, and the Kite app's order-placement screen shows a "high load, retrying in N ms" indicator. The user experience under saturation is "wait a moment" instead of "timeout" — a fundamentally different failure mode that costs about 200 lines of code at the API edge and saves the matching engine from queue-blowup at every market-open.

A composite worked example: Karan's order-confirmation service revisited

Returning to the lead. Karan's M/M/1 calculation gave R = 75 ms for rho = 0.76; production p99 = 410 ms. After two weeks of measurement, the picture is:

Plugging the corrected numbers into a simpy model gives predicted p99 of 387 ms — within 6% of the measured 410 ms. The original M/M/1 prediction of 75 ms was off by 5.5x; the corrected model is calibrated. The intervention that follows is now obvious: the burstiness multiplier is the dominant term, so the fix is to smooth the upstream batching (move from 30-second batches to 5-second batches at the cart service) before adding any payment-call capacity. Predicted post-fix p99: 142 ms. Measured post-fix p99 after the cart-service change shipped: 156 ms — within 10% of the prediction. Two weeks of measurement, one model, one targeted architectural change, and the SLO is back.

The deeper lesson is that the order of the corrections matters as much as the corrections themselves. The team's first instinct was "add payment-call capacity" — which would have helped, but only by reducing the rho term (the easier denominator), not the C_A^2 term (the harder numerator). The model surfaced the burstiness as the dominant contributor and re-routed the engineering effort from a capacity ask (expensive, slow, externally visible to FinOps) to an upstream architectural change (cheap, fast, invisible to FinOps). The model paid for itself the first time it was used.

Common confusions

Going deeper

Deriving the Pollaczek-Khinchine formula

Start from the fact that an arriving customer (Poisson arrivals) finds the queue in its time-average state (PASTA — Poisson Arrivals See Time Averages). The mean number waiting in the queue plus the residual service time of the customer currently being served gives the mean wait. The residual service time for an arrival to a busy server is E[S^2] / (2 * E[S]) for any service distribution (this is the renewal-theoretic result, not specific to exponential service). Combining gives Little's-Law style: L_q = lambda * W_q and W_q = (lambda * E[S^2]) / (2 * (1 - rho)). Substitute E[S^2] = (1 + C_S^2) * S^2 and rho = lambda * S to get the closed form. The non-obvious step is the residual-service-time formula — it is what carries the variance information from the service distribution into the queueing formula. Every M/G/1 result reduces to this one observation: heavy-tailed service has heavy-tailed residual time, which dominates the wait.

Batch arrivals, MMPP, and the time-averaging trap

The most common production sin in arrival modelling is to compute lambda as a 24-hour-average and feed it into M/M/1. The 24-hour average for IRCTC might be 5,000 req/s; the Tatkal-window rate is 80,000 req/s. M/M/1 at the average says rho = 0.4 and predicts 30 ms response times; M/M/1 at the peak says rho = 6.4 (saturated) and predicts an unbounded queue. Neither matches reality — the MMPP formula, which integrates over the regimes weighted by the time spent in each, predicts the actual response-time pattern of "fine for 23 hours, terrible for 30 minutes". The fix is to either compute per-regime numbers (and provision for the worst regime) or to fit the MMPP transition matrix and use the MMPP/M/1 formulas directly.

The same trap recurs for batch arrivals. Scaling lambda by E[X] (treating a batch as E[X] separate Poisson arrivals) gets the mean offered load right but understates the variance. The batch arrives at one instant, then nothing for the inter-batch gap — the queueing multiplier sees the instantaneous rate during the batch (very high) not the time-averaged rate. The closed-form M[X]/M/1 result adds a (E[X(X-1)]) / (2 * E[X]) term to the mean wait — this is the batch-variance correction, and it can be 10-100x the M/M/1 wait for batches with even modest size variance. The Aadhaar UIDAI auth pipeline uses the M[X]/M/1 formula explicitly because its arrival pattern is a sequence of large batches from state-government bulk-auth runs; the M/M/1 calculation under-predicts queueing by 4-7x for that workload. The unifying lesson: any time you average over a regime that is structurally non-stationary, the average is the wrong number to feed into the queueing formula.

Multi-server elaborations and the Kingman fallback

The two-letter elaborations matter at the call-centre level of detail. M/M/c keeps an infinite buffer with c servers; the wait formula is Erlang-C, which gives the probability that an arrival has to wait at all. M/M/c/c (no buffer, c servers) is Erlang-B, which gives the probability of blocking — the arrival is rejected outright. M/M/c/K (c servers, K total positions) interpolates between the two. Telcos size their voice-call switches with Erlang-B — the assumption is that a busy signal is the failure mode, not waiting. Web-service capacity planning usually wants Erlang-C — the assumption is that requests can wait. Database connection pools at most companies are M/M/c/K — c = pool_size, K = pool_size + queue_depth, and the SQLAlchemy pool_size=20, max_overflow=10, pool_timeout=30 configuration is exactly an Erlang-mixed model. Knowing which model your system fits decides whether your sizing math is right.

When the closed-form models do not fit at all — non-Poisson arrivals, non-exponential service, multiple servers — the Kingman approximation R_q ~ S * (rho / (1 - rho)) * (C_A^2 + C_S^2) / 2 is the queueing-theoretic back-of-envelope. It is exact for M/M/1 and within 10-20% for most realistic distributions when rho > 0.5. It fails when arrivals are negatively correlated (the synchronised-retry case from earlier, where C_A < 1 but the correlation is what matters), and it fails when the service distribution has infinite variance — a true power-law with shape parameter < 2, which does happen in some workloads. The canonical example is web-page sizes, which Crovella showed in 1996 to be Pareto with shape ~1.4; for that workload no finite-variance approximation works and you have to simulate. For most production work, however, Kingman is the formula to memorise — it is the one number you can compute on a napkin that gets within a factor of 1.2 of any G/G/1 system you encounter.

A note on closed-loop systems

Everything above is open — arrivals come from outside the system at rate lambda independent of the system's state. Closed queueing systems have a fixed population N of customers cycling through; the arrival rate at any stage is determined by the throughput of the rest of the system. Closed networks are governed by the Mean Value Analysis (MVA) algorithm, which is iterative rather than closed-form. The most common production case is "N user sessions with think-time Z between requests" — this is the think-time model of capacity planning, used heavily in load-test design. Locust's user-count parameter is exactly this: N users, each with a wait_time, generating a closed-loop offered load. The closed-loop number you measure is not the open-loop lambda — converting requires solving the MVA equations, and many capacity-planning teams skip the conversion and end up sizing with the wrong number.

Why open-loop and closed-loop give different numbers at the same nominal lambda: in a closed system, when the service slows down, the arrival rate naturally throttles because each "user" is waiting for their previous response before sending the next request — the system is self-stabilising and never sees rho > 1. In an open system, arrivals keep coming at the same lambda regardless of how slow the service has become, so rho can exceed 1 and the queue grows unboundedly. A closed-loop benchmark with N=1000 and Z=1s might report p99 = 80 ms at "1000 req/s nominal", but the open-loop equivalent of that workload at 1000 req/s would have p99 in the seconds. This is the same coordinated-omission lineage from chapter 50: the closed-loop test under-reports the queueing because it gives the queue time to drain. For capacity planning, always convert closed-loop measurements to open-loop equivalents using X = N / (R + Z) (Little's Law applied to the closed network) and compare to the open-loop offered load you actually face in production.

A capacity-planning workflow that uses all four corrections

A practical recipe distilled from the Razorpay, Hotstar, and Zerodha capacity-planning teams that have run this discipline for several quarters. Step 1: classify the workload. For each service, write down which assumptions M/M/1 violates: bursty arrivals (yes/no), variable service (yes/no — if cache-fronted, yes), networked stages (yes/no — almost always yes), bounded buffer (yes/no — almost always yes at some layer). The classification is binary per-assumption and takes ten minutes per service. Step 2: measure the parameters. Mean and variance of service time per stage, mean and variance of inter-arrival time per stage, buffer size at every bounded stage. The variance numbers are the new measurements most teams skip. Step 3: build a simpy model. Each stage is a simpy.Resource; the model is 50-200 lines for most services. Run it across a sweep of lambda and compare to production at the current operating point. The model that matches within 30% is calibrated. Step 4: project. Use the calibrated model to answer "if lambda grows by 2x, what is the response-time curve and where do the drops happen?". The answer is not "run the load test" — it is "consult the model, then validate the prediction with one targeted load test". The model is cheap to consult, the load test is expensive to run; running the model first lets you skip the load tests that would obviously fail.

The yield of this workflow, in teams that have shipped it, is typically a 30-50% reduction in over-provisioning (because the model identifies which stages are actually the bottleneck, not which ones the team intuitively suspected) plus a near-zero rate of capacity-related incidents in the next quarter (because the model has already projected the cliff and the team has already provisioned past it). The cost is one engineer-week per service to build and calibrate the model, plus ongoing 1-2 hours per quarter to re-calibrate as the workload evolves. The ROI is consistently positive in any service with non-trivial revenue at risk to capacity events.

Reproducibility footer

# Reproduce this on your laptop, ~25 s
python3 -m venv .venv && source .venv/bin/activate
pip install simpy hdrh
python3 wall_mm1_violations.py

Then change bimodal_service to a tighter distribution (return random.lognormvariate(math.log(18), 0.2)) and re-run; watch the M/G/1 column collapse onto the M/M/1 column as C_S drops toward 1. That collapse is the Pollaczek-Khinchine formula's prediction in your terminal.

Where this leads next

The next build (Part 9 — parallel scaling) revisits the same questions from the other side: instead of asking "how does response time blow up at high rho?", it asks "how does throughput stop growing as you add cores?". The two questions are dual — Amdahl's law is queueing theory's serial-fraction term in disguise, and the Universal Scalability Law explicitly composes Amdahl's alpha (Amdahl-bound serialisation) with a beta (coherence-overhead) term that is the queueing-theoretic congestion of inter-core coordination. The corrections in this chapter — non-Poisson arrivals, non-exponential service, networked stages, finite buffers — all reappear in multi-core scaling under different names (lock-acquisition burstiness, variable critical-section length, pipelined inter-core handoff, bounded queue depth in concurrent data structures).

The deeper transition is from "queueing theory is what you reach for when M/M/1 doesn't predict your numbers" to "queueing theory is the lens through which every contention question becomes computable". Part 9 brings the same algebra to bear on cores; Parts 10 (I/O) and 11 (allocators) bring it to bear on disk queues and arena contention; Part 14 brings it to bear on capacity planning across the whole stack. The shape of the algebra does not change; only the names of lambda and mu change to match the substrate.

Three operational habits to internalise from this Wall. First: always check which assumption M/M/1 is making that your system breaks. If it predicts 200 ms and you measure 410 ms, the gap is one of the four corrections — find which one before adding capacity. Second: measure the second moment of service time and inter-arrival time, not just the means. The Pollaczek-Khinchine and Kingman formulas both need variance, and almost no production dashboard reports it. Add it — ten lines of Python over the access log is enough. Third: count drops at every bounded stage — TCP backlog, gRPC channel, Kafka topic, semaphore, connection pool. Drops are the M/M/1/K signature; without the drop counter, the dashboard lies to you about whether the system is healthy. The netstat -s | grep -i drop output, the gRPC channel's grpc_server_handled_total{grpc_code="RESOURCE_EXHAUSTED"} counter, the JVM's RejectedExecutionException rate — these are the metrics that surface what response-time histograms hide.

A subtler fourth habit: build a single simpy model of your service's queueing graph, with each stage parameterised by its measured lambda_i, S_i, Var(S_i), buffer_i. Run it across a sweep of rho and compare to production. The model that matches production within 30% is your capacity-planning oracle for the next quarter; the model that misses by 5x tells you which correction you have not yet applied. Both are useful — the first lets you predict, the second tells you what to measure next. Razorpay, Hotstar, and Zerodha all maintain such models internally; the model is small, the discipline of keeping it calibrated is the work.

A fifth habit, often overlooked: distinguish "the service is slow" from "the queue is deep". A high p99 with a near-empty queue means the service itself is slow — the fix is in the code or the dependencies, not in capacity. A high p99 with a deep queue means the offered load is too close to the queue's knee — the fix is more capacity or less arrival burstiness, not in the code. Most observability stacks report only response-time histograms, leaving you guessing which case you are in. Adding a per-stage queue-depth gauge to the dashboard — pool.queue.size for SQLAlchemy, getQueueLength() for ThreadPoolExecutor, MAX(now - enqueued_at) for Kafka consumer lag — surfaces the distinction immediately.

The closing observation: M/M/1 is not the model of any real system. M/M/1 is the coordinate system in which every real-system queueing question gets stated, derived, and compared. Knowing it deeply lets you read every paper and every architecture decision in the literature; not knowing it leaves you at the mercy of consultants who will tell you to add 30% capacity. The four corrections are not optional addenda; they are the engineering mathematics that turn the teaching model into a planning tool.

References