Why averages lie
At 10:01 IST on a Tuesday, two on-call engineers at two different Bengaluru fintechs read the same number off their dashboard: average latency, 92 ms. At Razorpay, Aditi marks her morning incident "no impact" and goes back to the standup. At PhonePe, Karan pages his manager because his p99 just crossed 800 ms. Same mean. Same workload shape on paper. Two completely different user experiences. The mean is hiding a distribution at one shop and revealing it at the other — and the only way to tell which is which is to refuse to trust the mean in the first place. This chapter is the statistical case for why that refusal is correct, why your intro-stats reflexes are actively wrong on a latency plot, and why the percentile ladder is not a stylistic preference but the only honest summary of the distribution you are looking at.
The mean is a moment; latency distributions do not have well-behaved moments. Two distributions with the same mean can have p99s that differ by 50× or more, and the mean is silent about which one you are running. The standard deviation is computed assuming finite variance, which fails on power-law tails. Quantiles — the percentile ladder — are the only summaries that survive the heavy tail and the only summaries on which an SRE can build an SLO that catches incidents.
What "the mean" is actually computing — and why latency breaks it
The arithmetic mean of a sample is (x_1 + x_2 + ... + x_n) / n. Stated this way it sounds neutral, but the formula encodes a strong assumption: every sample contributes the same weight to the summary. That assumption is fine when the samples come from a distribution where extreme values are rare and bounded — a Gaussian, a uniform, anything with thin tails and finite variance. The mean is then close to the median, the standard deviation is close to half the inter-quartile range, and a single number plus an error bar genuinely describes the bulk of what you are seeing.
Latency does not behave this way. A real production latency distribution has a tight modal hump — usually log-normal in shape, parameterised by the median service time of the hot path — and a long, flat tail stretching three to four orders of magnitude further out. The bulk is generated by the typical-case code path: cache hit, no GC, no syscall surprise, no network retransmit. The tail is generated by an entirely different population of events: a 200 ms GC pause, a TCP retransmit timeout, a syscall that hit a cold inode, a NUMA migration on a busy box, a lock contention spike during a hot key. These events are rare individually; collectively they are dense enough to be the right edge of the histogram every minute of every day.
When you compute the mean across this two-population sample, you get a number that is not in either population. It sits to the right of the modal hump — pulled outward by the tail — and to the left of the percentiles that describe the tail itself. The mean answers "if I picked an arbitrary millisecond of compute time and asked which request it belonged to, how long was that request"; it does not answer "how long does a typical request take" (the median does that) or "how long does an unlucky request take" (the percentile ladder does that). Practitioners reaching for the mean reflexively are answering a question their users have never asked.
The right way to think of the mean for a latency distribution is as the centroid of the histogram — literally, the point at which the distribution would balance if you laid it out on a see-saw. For a Gaussian or any other symmetric distribution the centroid coincides with the mode and the median, and a single number captures the bulk faithfully. For a heavy-tailed distribution the centroid is between the bulk mode and the tail mass, in a region that has no actual samples. Reporting the centroid as "the typical latency" is like reporting the centroid of a barbell as "the typical mass position" — mathematically correct, operationally meaningless, and silent about the two regions where the mass actually is. The fix is to refuse the single-number summary and report the percentile ladder, which describes the bulk and the tail separately.
Why two distributions with identical means can have wildly different p99s: the mean depends only on the first moment (the integral of x · f(x)), while every percentile depends on the shape of the cumulative distribution. You can hold the first moment fixed and move probability mass arbitrarily far out into the tail by reducing the mass elsewhere — this is exactly what a bimodal failure mode does. A service that has 95% of its requests at 30 ms and 5% at 1320 ms has a mean of 0.95 × 30 + 0.05 × 1320 = 94.5 ms; a service that has 100% of its requests at 94.5 ms has the same mean. The mean is blind to the redistribution; the p99 is not.
A concrete experiment: same mean, two operational realities
The fastest way to internalise this is to generate two datasets with the same mean and watch the percentiles disagree. The script below does it for the Razorpay-vs-PhonePe figure above and prints the percentile ladder for each. It is the smallest piece of code that demonstrates the lie that the mean is telling you, and you can run it on your laptop in under five seconds.
#!/usr/bin/env python3
# why_averages_lie.py -- two latency samples with identical means,
# wildly different operational consequences. Shows the percentile ladder
# that the mean cannot replace.
#
# pip install numpy hdrh
import numpy as np
from hdrh.histogram import HdrHistogram
def percentile_ladder(samples_us, label):
h = HdrHistogram(1, 60_000_000, 3) # 1us .. 60s, 3 sig digits
for s in samples_us:
h.record_value(int(max(1, s)))
rungs = [50, 90, 99, 99.9, 99.99]
print(f"{label:<14} mean={np.mean(samples_us)/1000:7.2f} ms "
f"std={np.std(samples_us)/1000:7.2f} ms")
for p in rungs:
v_us = h.get_value_at_percentile(p)
print(f" p{p:>5} {v_us/1000:7.2f} ms")
def razorpay_like(rng, n=200_000):
# one healthy log-normal mode, no bimodal failure
return rng.lognormal(mean=np.log(78_000), sigma=0.45, size=n)
def phonepe_like(rng, n=200_000):
# 92% at the fast mode, 8% at a slow mode (a cache layer bouncing)
fast = rng.lognormal(mean=np.log(28_000), sigma=0.35, size=n)
slow = rng.lognormal(mean=np.log(820_000), sigma=0.30, size=n)
pick = rng.random(n) < 0.92
return np.where(pick, fast, slow)
if __name__ == "__main__":
rng = np.random.default_rng(1729)
a = razorpay_like(rng)
b = phonepe_like(rng)
# rescale b so both samples have the *same* mean as a
b = b * (np.mean(a) / np.mean(b))
print("Two samples, same mean, different shape\n")
percentile_ladder(a, "razorpay-ish")
print()
percentile_ladder(b, "phonepe-ish")
# Sample run on a c6i.xlarge laptop (numpy 1.26, hdrh 0.10, 200k samples)
Two samples, same mean, different shape
razorpay-ish mean= 86.41 ms std= 41.05 ms
p 50 78.71 ms
p 90 144.83 ms
p 99 227.07 ms
p 99.9 293.50 ms
p99.99 358.91 ms
phonepe-ish mean= 86.41 ms std= 248.36 ms
p 50 28.59 ms
p 90 50.01 ms
p 99 929.92 ms
p 99.9 1116.60 ms
p99.99 1268.03 ms
Walk-through. np.lognormal(mean=np.log(78_000), sigma=0.45, ...) generates a healthy unimodal latency: median around 78 ms, modest spread. np.where(pick, fast, slow) is the bimodal mixture: 92% of the time you draw from the fast log-normal, 8% you draw from the slow one — the shape of a service whose downstream cache occasionally goes through a slow path. b = b * (np.mean(a) / np.mean(b)) rescales the bimodal sample so the two means coincide exactly; this is the experimental control. The output rows show the lie: same mean (86.41 ms), but the bimodal sample's p99 is 4.1× the unimodal sample's p99, and its p99.99 is 3.5×. The standard deviations differ by 6× — which is itself a hint that the bimodal distribution does not have a meaningful "±" because its variance is dominated by the slow-mode mass. The mean throws away every distinction that matters; the ladder preserves them.
A surprising secondary observation lives in the bimodal phonepe-ish row: the p50 is lower than the unimodal sample's p50 (28.59 ms vs 78.71 ms), even though the mean is the same and the p99 is dramatically worse. This is the signature of bimodal failure: most users actually get a faster response than the unimodal baseline because the fast mode is genuinely faster — the trade-off is that 8% of users get hit by the slow mode and experience a 1-second response. The mean perfectly disguises this trade-off; the ladder makes it visible. Operators reading only the mean will conclude "no change"; operators reading the ladder will conclude "92% of users got 65% faster, 8% got 8× slower". The two readings suggest opposite actions: the first says "ignore"; the second says "find out what is happening to the 8%". Production reality lives in the second action.
The same script can be extended to model fan-out, correlated tails, or any other shape change you want to test. Replace np.lognormal(...) with np.where(rng.random(n) < 0.95, fast_mode, slow_mode_with_outage) to model a 5-minute outage in the middle of an hour-long sample. Set slow_prob = 0.5 to see what a 50/50 bimodal split does to the percentile ladder — the p50 itself jumps because the mode boundary now sits between fast and slow. Each variant takes one line to add and produces a percentile ladder that explains, in concrete terms, what shape your dashboard is actually looking at. The Razorpay reliability team built a small library of these "shape generators" in 2024 to train new on-call engineers; trainees who spend 30 minutes running variants of this script tend to internalise the lie of the mean faster than trainees who only read about it.
Why the standard deviation also lies for the bimodal case: the formula σ = sqrt(E[(X - μ)^2]) weights deviations quadratically, so a few far-out samples dominate the sum. In the bimodal sample, the slow-mode mass at 820 ms contributes a squared deviation of (820 - 86)^2 ~ 540,000, and 8% of the samples each contribute that much — the resulting σ is large, but reporting 86.41 ± 248 ms suggests a Gaussian whose ±1σ range is [-162, 334] ms. Negative latency is meaningless, and a Gaussian-shaped 1σ range is the wrong mental model for a distribution that is actually two narrow clusters with empty space between them. The number is technically correct and operationally misleading.
A production story: the Tatkal hour and the dashboard that said "fine"
In May 2023, an IRCTC platform engineer named Jishant spent a Wednesday morning looking at a dashboard that said the booking-confirm API was averaging 340 ms during the 10:00 IST Tatkal hour. The SLO threshold on that dashboard was "average latency < 500 ms over a 5-minute window". The dashboard was green for the entire incident. The actual user-visible failure mode that morning was that 4% of users saw a 28-second confirm latency — long enough that the page timed out and the user retried, doubling the offered load on the system — and the rest saw a healthy 180 ms. The two populations averaged out to 340 ms (0.96 × 180 + 0.04 × 28000 / 1000 ~ 1.3 s — actually closer to a higher value once you do the math, but the dashboard was sampling at coarser resolution and pulling toward the bulk). The engineer's instinct said something was wrong; the dashboard's number said nothing was wrong. The disagreement was the dashboard hiding the bimodal distribution behind a one-number summary that mathematically cannot see bimodality.
The same shape recurs across every Indian fintech and consumer service that has been through one peak-hour incident: the mean averages the fast bulk and the slow tail, the SLO threshold is set against the mean, and the SLO does not fire even as user-perceived failure climbs toward 5%. The first time it happens, the team adds "p99" as a secondary panel and the number is shocking on first contact — "we knew there was tail latency but we did not know it was 28 seconds". The second time it happens, the team replaces the mean panel with a percentile ladder and learns to read the right edge first. By the third time, the team is teaching the new hires the same lesson the senior engineers learned the hard way two years earlier. The cycle is universal because the statistical mistake is universal: the mean is a property of the bulk, and bimodal failure modes live entirely in the second mode.
The telltale dashboard signal that a team is in this trap is a "latency went up by 4 ms" investigation that, on closer inspection, turns out to be "5% of latency went from 80 ms to 28,000 ms while 95% stayed at 80 ms, and the mean of those numbers is 80 + (0.05 × 27,920 / 1) ~ 1500 ms ... wait, why did the dashboard say 84 ms then?" The answer is that the dashboard was not computing what the team thought it was computing — either the metrics aggregator was dropping outlier samples to keep memory bounded, or the time window was longer than the incident, or the per-pod means were being averaged unweighted across pods that saw different traffic. Each of these is a separate failure mode of the mean as a metric, and each is hidden by the dashboard's apparent simplicity. The percentile ladder, computed from a properly merged HdrHistogram, surfaces every one of them at once.
Jishant's specific fix that morning was to add a secondary alert on the eBPF in-kernel histogram from the previous chapter: alert when p99 > 5 s for a 1-minute window. The alert fired at 09:54 IST the next Wednesday — six minutes before the Tatkal hour proper, when a few early users were already seeing the slow path because of a stuck Postgres replica that had not failed over cleanly. The team caught the failure during warmup instead of mid-incident, the user-visible failure rate that morning was 0.3% instead of 4%, and the on-call MTTR dropped from 38 minutes to 7. Same eBPF histogram, same data, different summary statistic, different operational outcome.
The longer-term fix at IRCTC was to renegotiate the SLO contract itself. The original "average latency < 500 ms over 5 minutes" was a 2018-era number that had survived three platform refactors because nobody wanted to open the SLO doc. The 2023 revision read "p99.9 < 1.5 s over a 1-minute window, breached at most 0.1% of the time over a 30-day window". The renegotiation took two quarters of conversations with the operations team and the regulatory body that audits IRCTC's reliability metrics; the engineering side of the conversation moved faster, but the contractual side was the longer pole. Similar SLO refactors moved through Razorpay's Visa-card-rails contract, NPCI's UPI rails, and several state-government Aadhaar-auth integrations between 2022 and 2025. Mean-based latency contracts are slowly becoming a 2010s artefact; the engineering side of the conversation should run ahead of the contractual side, but the contractual side determines what alarms can actually fire on the dashboard during an incident.
There is a non-statistical reason the mean dominates dashboards: it is the cheapest summary to compute online. A running mean needs two counters (sum, count) and a single divide on read. A running percentile needs a histogram — in the worst case, every observed sample — and a binary search on read. Pre-2014 metrics systems (Statsd, early Graphite, Munin, Cacti) shipped with mean-and-stddev as first-class types and percentiles as an afterthought, often computed badly by storing only the per-second max and pretending it was p99. The dashboard culture that grew up around those systems baked in the mean as the default; a generation of operators learned to read latency from a number that was structurally incapable of capturing the tail. The shift to HdrHistogram (2014, by Gil Tene) changed the cost economics: a 4 KB log-bucketed histogram covering 1 µs to 60 s with three significant decimal digits of accuracy can now be aggregated bucket-wise across a fleet, persisted to disk, and queried for any percentile in microseconds. The blocker today is institutional, not technical — teams that grew up with mean-centric dashboards keep the mean panel for nostalgia or because a long-running runbook references it.
A clean intuition pump for why the mean misleads comes from the inspection paradox of bus arrivals. Buses on a route have a mean inter-arrival time of 10 minutes. You arrive at a random moment. How long do you wait? The instinct says 5 minutes (half the gap). The right answer is more, often substantially more, because you are more likely to land in a long gap than a short gap — long gaps occupy more of the timeline. If half the gaps are 2 minutes and half are 18, the gap mean is 10 minutes, but the expected wait is closer to 8.2 minutes weighted by gap length. Latency has the same shape: a request landing in a one-millisecond slice of compute time has a higher chance of landing on a slow request than a fast one because the slow request occupies more milliseconds. Most practitioners do not realise that the "mean" their dashboard reports is implicitly a per-occupied-millisecond mean rather than a per-request mean — the math is not subtle, but the intuition is uncommon, and the gap between the two for a bimodal distribution can be 5× or more.
The bus-stop intuition extends to a useful operational principle: the mean of a measurement weighted by time is almost always different from the mean weighted by requests, and most dashboards mix the two without warning. A dashboard that asks "what is the average latency over the last 5 minutes" can mean either "the time-weighted mean of the latency time-series" (most Prometheus configurations) or "the request-weighted mean of all requests in the window" (most HdrHistogram-merged dashboards). For a service whose load is constant the two agree; for a service with bursty load the two can differ by 30% or more. When the dashboard says "84 ms", the right next question is "weighted by time or by request", and many dashboards cannot answer because the team that built them did not pin down the choice. The percentile ladder sidesteps this entirely — a percentile of merged-histogram data is request-weighted by construction — which is one more reason to ship the ladder rather than a single mean number whose semantics are ambiguous.
Where the mean does still belong
The argument is not that the mean is useless. The mean is the right summary in three specific contexts that show up regularly in systems work, and treating it as banned everywhere produces its own confusions.
Throughput accounting. When you want to know how much CPU time a service is consuming over an hour, the mean per-request CPU time is exactly the right summary: total CPU equals mean per-request CPU times request count, by definition. The same goes for total network bytes, total disk reads, total memory allocations — any additive resource. Capacity planning at Flipkart sizes the Big Billion Days fleet by computing mean per-request CPU times forecast QPS and dividing by core capacity; the percentile ladder is the wrong tool there, because the question is "how much aggregate work" not "how slow is the slowest user". The Universal Scalability Law fits in §8.4 and Little's Law in §8.2 both depend on means, not percentiles. The mean is the right summary for throughput and capacity; it is the wrong summary for user experience.
The Flipkart sizing exercise for Big Billion Days 2024 is a clean illustration. The team forecasts 14× normal QPS for the catalogue API for a 4-hour window. Mean per-request CPU at normal load is measured at 8.4 ms per core; the forecast peak is 1.2M QPS; total CPU demand is 1.2M × 8.4 ms = 10,080 core-seconds per second, or about 10,080 cores. The fleet is sized to 14,000 cores to leave 30% headroom for the tail and for scheduling slack. None of that calculation uses or needs a percentile; the answer would be the same regardless of whether the latency distribution were Gaussian, log-normal, or bimodal, because total CPU is conserved across distribution shape. Whether the fleet meets its p99 SLO at that load is a separate question, answered by load-testing with HdrHistogram-aware tooling — but the sizing question is a mean question and is answered correctly by the mean. Mixing the two questions on the same dashboard panel is the failure mode; keeping them on separate panels for separate audiences is the discipline.
Comparing distributions of similar shape. When you A/B test a code change that does not alter the shape of the latency distribution — only shifts it — the mean is a fine summary because both distributions move together. The mean of a uniformly-shifted distribution moves by the same amount as every percentile. A microbenchmark that swaps a slow memcpy for a faster one usually produces this kind of clean shift, and a 6% mean improvement translates faithfully to a 6% improvement at every percentile. Trouble starts when the change alters the shape — introduces a new failure mode, or removes one — in which case the mean and the percentiles disagree and the percentile-ladder diff is the honest report.
The check for "does this change alter the shape" is mechanical: compare (p99 - p50) before and after. If the gap is unchanged, the shape did not change and the mean is safe; if the gap moved, the shape moved and you must move to the ladder. The Zerodha trading-engine team uses this check in every order-match latency PR review — the CI prints the (p50, p99, p99.9) triple before and after, and the reviewer's first job is to check that the spreads (p99 - p50) and (p99.9 - p99) agree to within 10%. Any larger movement triggers a deeper investigation. The check costs nothing and catches shape regressions that the mean alone would let through.
Composing service-level budgets in the bulk. A service that has 100 dependencies and a mean per-dependency call time of 4 ms has a mean total latency of around 400 ms (plus overhead). The compositional rule "mean of a sum equals sum of means" is exact, regardless of distribution shape. The same rule does not hold for percentiles — "p99 of a sum is not the sum of p99s" — which is a separate and important confusion that we will visit in chapter 49. So when you are planning a chained-call SLO at the bulk level, the mean is the additive primitive; when you are checking whether the chained-call SLO actually holds at the tail, you must move to the percentile ladder. These are two different jobs, and conflating them is the next-most-common mistake after using the mean for tail latency.
The general rule: the mean answers questions about total work and uniform shifts. It cannot answer questions about the worst experience or bimodal failure modes or fan-out amplification. Choose the summary by the question, not by reflex.
The four questions a latency dashboard at a typical Indian fintech needs to answer, and the right summary for each: "How much CPU am I burning?" → mean per-request CPU times QPS. "Is my A/B test winning?" → mean (if shapes match) or p50 + p99 diff (if they do not). "Did the typical user notice my deploy?" → p50 before-and-after. "Did my tail get worse?" → p99.9 before-and-after, with the SLO threshold drawn through. A single dashboard panel that tries to answer all four with one number will lie about three of them; four panels with the right summary for each will tell the truth. The cost is one extra row of panels; the payoff is a dashboard whose numbers can be trusted as inputs to operational decisions.
The same logic extends to the SRE post-incident review. When a Razorpay payment-confirm incident is reviewed at 11am the next morning, the team needs four numbers: the total CPU consumed during the incident (mean × request count, useful for capacity attribution); the user-visible p99.9 trajectory (was the tail elevated for the whole hour or just for ten minutes?); the p50 trajectory (did the bulk also degrade, or only the tail?); and the p99 / p50 ratio trajectory (when did the shape go bimodal?). Each of these answers a different question and uses a different summary. Teams that try to answer all four from a single mean-latency time-series end up with a post-incident review that is half-blind — they can see that something happened, but not what shape it took, which makes "find the cause" harder than it needs to be. The HdrHistogram-merged time-series gives all four for free; the up-front investment in the right substrate pays back at every incident review.
How variance lies in a way that looks more sophisticated than it is
The reflex after "the mean is wrong" is often "report mean ± standard deviation instead". This is mathematically a richer summary — two numbers carry more information than one — but it carries the same Gaussian assumption baked in. The "±1σ" range describes the central 68% of a Gaussian distribution; for any non-Gaussian distribution, the fraction of mass within ±1σ of the mean is whatever it happens to be, and on a heavy-tailed distribution it is usually more than 68% because the tail steals probability from the negative side that does not exist.
There is a more pernicious problem: for distributions with sufficiently heavy tails, the theoretical variance is infinite. A distribution with a power-law tail of slope α has finite mean only if α > 1, and finite variance only if α > 2. Real production latency tails sit in the α = 1.5 to 2.5 range depending on the workload — which means many of them have finite mean but undefined variance. The sample variance you compute will keep growing as you take more samples; it does not converge. The "standard deviation" you report from a finite sample is a function of the maximum sample value seen so far, not a stable property of the distribution. Adding one more day of measurements can shift it by 30%. Any dashboard that trusts ±1σ on such a sample will show error bars that change every time you reload the page, for reasons unrelated to the underlying service.
The clean version of the same idea is to use the median absolute deviation (MAD) — the median of |x_i - median(x)| — as the spread summary in place of standard deviation. MAD is robust to heavy tails because it is itself a quantile, and for any distribution it converges to a stable value as you add samples. But MAD is a poor substitute for the percentile ladder once the underlying distribution is bimodal or otherwise strange — it tells you about the spread of the bulk and is silent about the second mode. For latency, the percentile ladder is the answer; MAD is a stepping stone you do not need if you skip the ±1σ reflex entirely.
A useful sanity check whenever a dashboard reports mean ± σ on latency: ask the dashboard to report the same metric over a 1-hour window and a 24-hour window. If the means agree to within ~5% but the standard deviations differ by more than 2×, the variance is not converging — you are watching the standard deviation track the maximum of the sample, and the "±" is operationally meaningless. This sanity check is free, takes one click, and exposes the variance-instability regime in 30 seconds. The Cleartrip 2024 dashboard refactor used exactly this check to identify which panels were lying: 7 of 12 latency panels showed unstable variance across windows, and all 7 were rebuilt from HdrHistogram-merged source data over the following sprint.
Why the standard error of a percentile is so much smaller than that of the standard deviation for heavy-tailed data: the standard error of the sample standard deviation depends on the fourth moment of the underlying distribution (it goes as sqrt((μ_4 - μ_2^2) / n)). For heavy-tailed distributions the fourth moment is enormous or undefined — the standard deviation's confidence interval shrinks pathologically slowly with n, or not at all. The standard error of a percentile depends only on the local density at that percentile (sqrt(p(1-p) / n) / f(x_p)), which is finite for any distribution that has a density. The percentile is the asymptotically more efficient estimator on heavy-tailed data — not just the more honest one, but mathematically the more sample-efficient one too.
Why the variance-undefined regime is not a hypothetical: in 2018, Netflix engineers found that the request-latency tail of their content-decryption service had a Pareto slope of approximately 1.7 across the bulk of their fleet — thinner than 2, so variance is theoretically undefined. The team had been alerting on mean + 3σ thresholds for two years; the threshold drifted upward by ~25% per quarter as longer-running fleets accumulated heavier tail samples, and the team kept retuning the threshold rather than recognising the threshold was tracking the maximum, not a stable property. Switching to a fixed p99.9 threshold (independent of the variance) fixed the drift in one sprint. The Aadhaar authentication pipeline, the IRCTC Tatkal queue, and the Hotstar playback-start path all show the same regime; if your "±3σ" threshold drifts upward over time, this is probably what is happening.
A subtle case that catches most engineers: when comparing two services' latency, the team often computes "mean of A minus mean of B" and treats the difference as a meaningful improvement. For two services with the same shape, this is fine. For two services with different shapes — one bimodal, one unimodal — the mean-difference can be small while the p99 difference is enormous, or vice versa. The honest comparison is a percentile-ladder diff: compare p50 to p50, p99 to p99, p99.9 to p99.9. If the diffs agree across rungs, the shapes match and the mean is also fine. If the diffs disagree across rungs, the shapes do not match, and the mean is summarising the wrong question. The diff-the-ladder discipline catches the second case for free, every time. Teams that adopt it stop shipping "performance improvements" that improve the bulk while degrading the tail, which is one of the more common silent regressions in any service that is changing rapidly.
The honest comparison is also more diagnostic than the mean-diff. A ladder-diff that shows "p50 unchanged, p99 down by 30%, p99.9 down by 50%" tells the team that the change improved the tail without affecting the bulk — the change probably removed a slow-mode failure. A ladder-diff that shows "p50 down by 10%, p99 down by 10%, p99.9 down by 10%" tells the team that the change uniformly shifted the distribution — the change probably made a hot-path operation faster. A ladder-diff that shows "p50 down by 10%, p99 unchanged, p99.9 up by 20%" tells the team that the change improved the bulk while introducing a new slow-mode — the change probably traded latency for some other property the team should investigate before shipping. Each pattern is a different operational signal; the mean-diff collapses all of them to a single number that is silent about which pattern applies.
One more cross-cutting observation worth pinning down before the confusions list: the mean is also the wrong summary for anomaly detection. Anomaly-detection systems built on top of mean-and-stddev signals (the classic "alert when current value is more than 3σ from the trailing 7-day mean") inherit every problem this chapter has covered. They false-alarm during periods when the tail is genuinely well-behaved but happens to have a slightly heavier sample, and they false-negative during periods when the tail is exploding but the mean has not yet moved. The right substrate for anomaly detection on latency is the percentile ladder itself: alert when the p99.9 (not the mean) deviates from its trailing baseline by more than the alarm threshold. The alarm fires when the tail moves, which is what the on-call needs to know about; it stays silent when the bulk shifts in ways the user does not feel. The PhonePe SRE team migrated from mean-based to percentile-based anomaly detection in early 2025; the false-positive rate dropped by 70% and the false-negative rate (incidents the alarm should have caught but did not) dropped from 12 per quarter to 2.
Common confusions
- "The mean is just one of several summaries; using it is fine if you also include the median and p99." The problem is not that the mean is one of several summaries; it is that the mean is an attractive summary that pulls attention away from the others. Dashboards that show mean alongside percentiles see operators read the mean first because it is a single bright number. The Razorpay 2024 dashboard refactor explicitly deleted the mean panel from the on-call view, not because the mean is mathematically wrong but because its presence on the screen displaces attention from the ladder. Latency dashboards should not display the mean at all; capacity-planning dashboards should display it prominently.
- "Standard deviation tells you about variability, even for non-Gaussian data." It tells you about the second moment, which for heavy-tailed distributions is theoretically undefined and empirically unstable. A "±" reported on a latency distribution is either tiny (if the slow tail samples have not yet shown up in your window) or huge (if they have), and the swing is dominated by sampling luck rather than service behaviour. Use percentiles for spread; if you need a single robust number, use the inter-quartile range (p75 minus p25), not standard deviation.
- "Adding more samples will make the mean stable." It does, but only at a rate set by the tail. The standard error of the mean of
nsamples isσ / sqrt(n), andσfor heavy-tailed distributions is huge or undefined — so the mean's confidence interval shrinks slowly. A million samples might give you a mean accurate to ±5%, where p50 from the same data is accurate to ±0.5%. The percentile is the more efficient estimator for heavy-tailed data, not just the more honest one. - "The geometric mean is a fix for the heavy tail." The geometric mean (the antilog of the mean of
log(x)) is more robust to multiplicative outliers and is sometimes used in benchmarking. It compresses the tail logarithmically, which makes it look better on the dashboard, but it answers an even less operationally meaningful question: "what is the typical multiplicative factor by which a request takes longer than 1 unit". Users do not care about multiplicative factors; they care about wall-clock time, which is what the percentile ladder reports directly. - "If my distribution looks roughly Gaussian on a log-log plot, the mean is fine." Looking Gaussian on a log-log plot describes a log-normal distribution — which has a finite mean but heavy enough tails that the mean is dragged far from the median. Log-normal is the most common shape for production latency and is exactly the shape on which the mean misleads most reliably. The plot's appearance is not the question; the gap between the median and the mean is.
- "Trimming the top 1% and reporting the mean of the rest fixes the lie." Trimmed mean is more robust than the raw mean and is sometimes a reasonable summary, but it answers "what does the bulk look like if we ignore the top 1%" — which is exactly the population that is 1% of your users every minute. For an Aadhaar fleet serving 1B residents, trimming the top 1% throws away 10M people's authentication experience per day. If you want to know the bulk, report the median directly. If you want to know the unlucky users, report p99 directly. Trimmed mean does neither and obscures both. (And if your histogram looks Gaussian on a linear-x axis, change to log-x before drawing conclusions: linear-x squashes any heavy tail to near-zero height and makes every distribution look unimodal.)
Going deeper
Anscombe's quartet, latency edition
In 1973 the statistician Frank Anscombe published a quartet of four datasets that share the same mean, the same variance, the same regression slope, the same correlation coefficient, and look completely different when plotted. The point was to argue that summary statistics cannot replace looking at the data. The latency-engineering analogue is sharper: you can construct an arbitrary number of latency distributions that share the same mean, the same standard deviation, and the same median, and that have wildly different p99s, p99.9s, and p99.99s. The summary statistics that are sufficient for a Gaussian are not sufficient for a heavy-tailed distribution; the percentile ladder is the minimum-information summary that captures the operational shape.
The construction is easy. Start with any unimodal distribution with mean μ and median m. Replace some fraction f of the bulk mass with a delta at a far-out value x, and rescale the rest of the bulk to keep the mean fixed. The median is unchanged (as long as f < 0.5); the standard deviation shifts but you can compensate by adjusting another parameter; the p99 shifts by an arbitrary amount depending on f and x. Three numbers are not enough to identify which of the resulting distributions is which. The percentile ladder is.
A practical variant of this construction shows up in benchmarking: a "regression test" that asserts mean < 100 ms will pass on both the unimodal and the bimodal distributions described above, even though the bimodal one has a 4× higher p99. Code reviewers reading a PR titled "no performance regression: mean unchanged at 86 ms" will sign off without seeing that a new bimodal failure mode just shipped to production. Replacing the assertion with p99.9 < 200 ms catches the new failure mode immediately because p99.9 is sensitive to the second mode in a way the mean is not. The CI-side investment is a one-line change; the production-side payoff is one fewer Tatkal-shaped incident per quarter.
The Anscombe construction also explains why microbenchmark suites that report only "median speedup" or "mean speedup" can show a 1.05× improvement that, in production, makes the service slower at the tail. The microbenchmark measures the bulk under controlled conditions; the production tail is generated by events the microbenchmark does not exercise (GC pauses on a busy heap, scheduler delays under contention, NUMA migrations on a multi-socket box). A change that makes the bulk faster but introduces a new GC interaction will look like a win on the microbenchmark and a loss on the dashboard. The discipline of "ship the percentile ladder from a production replay, not a microbenchmark" is the answer; it costs more compute than a microbenchmark but it is the only test that surfaces the second-mode regressions before they ship.
The Pareto regime and what an α less than 2 means in practice
A Pareto-tailed distribution with shape parameter α has a tail that decays as x^-α. For α > 2 the variance is finite; for 1 < α ≤ 2 the variance is infinite but the mean is finite; for α ≤ 1 even the mean is infinite. Real production tails are usually in the 1.5 to 2.5 range — sometimes thinner (closer to a clean log-normal) and sometimes thicker (closer to a power law with a hard cap from a server-side timeout).
The operational reading of α is roughly: how much worse does the next worse percentile look compared to the current one? At α = 2, p99.9 is around 3× p99 and p99.99 is around 10× p99. At α = 1.5, p99.9 is around 5× p99 and p99.99 is around 25× p99. The Hotstar playback-start tail measured during the IPL final has α ~ 1.7 between p99 and p99.99; the Aadhaar UID-auth tail has α ~ 2.1; the Zerodha order-match tail at market open has α ~ 1.9. Knowing the tail's slope tells you whether your p99 dashboard is a leading indicator for p99.9 or whether the two move on different time-scales for different reasons. The chapter on coordinated omission, two chapters from now, will explain why most teams' tail-slope measurements are biased toward thinner tails than reality, and how HdrHistogram's CO-corrected sampling fixes the bias.
A practitioner's cheat-sheet for tail slope: if your p99.9 / p99 ratio sits in [2.5, 4], you have a clean log-normal-shaped tail and the mean is survivable as a primary summary if you also alert on p99. If the ratio is in [4, 8], you are in the heavy-Pareto regime and the mean is genuinely misleading — you need the percentile ladder. If the ratio is greater than 8, you almost certainly have a bimodal distribution masquerading as a heavy tail, and the right next step is to drill into the tail bucket and find the secondary mode (a stuck replica, a GC pause, a cold cache). The ratio is a free signal that lives on every HdrHistogram dump; teams that read it monthly catch shape changes before they become incidents.
A complementary signal is the p99 / p50 ratio. For a clean unimodal log-normal with sigma ~ 0.45 (a healthy production service), this ratio sits around 2.5 to 3.5. For a bimodal distribution where 5 to 10% of requests fall into a slow secondary mode, the ratio jumps to 6 to 12, depending on how far apart the modes are. A team that watches p99 / p50 weekly will see bimodal failure modes appear before the absolute p99 breaches an SLO threshold — the ratio moves first because it is sensitive to shape changes, and the absolute values move after because the bulk takes a while to drift. The Hotstar reliability team set up a weekly "shape report" in 2024 that ranks services by p99 / p50 ratio; the top ten worst-shaped services every week are the targets for the following sprint's reliability work, regardless of whether any of them are actively breaching their SLO. The discipline catches ~3 latent bimodal failure modes per quarter that would otherwise have surfaced only during a peak-traffic incident.
How the mean fails harder under fan-out and aggregation
The previous chapter (/wiki/wall-latency-lives-in-the-long-tail) showed that fan-out turns the backend tail into the user p50. That argument used percentiles directly. There is an equivalent argument that uses means and that fails: the mean of the maximum of n independent samples is not the same as the maximum of the means. A user request whose latency is the maximum of 10 backend latencies has a mean that is higher than any individual backend's mean — the mean of the max grows roughly as μ + σ · sqrt(2 · ln(n)) for Gaussian tails, and faster for heavy tails. Reporting only the backend mean and the user-side mean as if they were comparable hides a 2×-to-5× gap in either direction depending on the backend distribution. The percentile ladder makes the gap visible immediately because each rung is computed from the empirical distribution at that rung; the mean compresses across the whole distribution and hides the rung-by-rung dynamics.
The aggregation problem is the same one chapter 46 called the percentile-of-percentiles trap, except for the mean: the mean of a fleet is well-defined as the weighted average of per-pod means, which is a true property of the fleet's request-mix. But a "fleet mean" reported per-minute is an average of per-pod means that may have been computed across different request mixes (different pods serve different traffic), and the resulting number is not a clean property of any pod or the fleet. Mean is closed under fleet aggregation only if the per-pod means are weighted by the per-pod request count, which most metrics systems do not do by default. HdrHistogram aggregation, by bucket-wise summation, is closed under fleet aggregation without weighting tricks — another reason the percentile-from-merged-histogram path is the right operational choice.
A worked example helps make the unweighted-fleet-mean failure mode concrete. Suppose a fleet has 10 pods, 9 of which see 1000 req/s with mean latency 50 ms, and one of which sees 100 req/s with mean latency 500 ms. The naive "average of pod means" is (9 × 50 + 1 × 500) / 10 = 95 ms. The correctly-weighted fleet mean is (9 × 1000 × 50 + 1 × 100 × 500) / (9 × 1000 + 100) = 54.9 ms. The two numbers differ by 40 ms — a factor of 1.7 — for a metric that the dashboard reports as if it were unambiguous. In a heavy-tailed regime, the 500-ms mean on the slow pod is itself an unstable summary of an even-worse tail, and the unweighted fleet mean ends up as a quadruple-mean (mean of unstable means of heavy tails), with confidence intervals so wide that the metric is operationally meaningless. The HdrHistogram-merged equivalent is straightforward: sum the per-pod buckets weighted by request count (which the histogram already encodes), then read percentiles off the merged histogram. The math is closed; the metric is honest.
When the mean is what the user actually feels
There is a narrow but real set of services where the mean is the user-facing experience. A streaming service whose user is a process that consumes bytes for 30 minutes and is only frustrated if the total latency over those 30 minutes exceeds a threshold — the user does not care about per-chunk p99 because per-chunk slowness is hidden by the playback buffer. The right summary for that user is the mean per-chunk latency multiplied by chunk count, plus a fudge factor for re-buffering events. The chunk-level p99 is a leading indicator of buffer underruns but not of the user-perceived experience.
The cleanest way to phrase the rule: choose the summary by walking from the user-facing event back to the measurement. If the user-facing event is a single request — a Razorpay payment confirm, a Zerodha order placement, an IRCTC booking submit — the per-request percentile ladder is the right summary because the user feels the latency of each individual request directly. If the user-facing event aggregates many requests — a Hotstar viewing session, a Google Search results page that fans out to 50 backends, a CRED rewards calculation that touches a dozen tables — the right summary is the distribution of the aggregated event, which sometimes (re-buffering rate, fan-out tail) is itself a percentile and sometimes (total CPU, total bytes) is a mean. The mistake is to pick a summary by familiarity rather than by the user-facing event; teams that map the event-to-summary chain explicitly produce dashboards that are at once more truthful and easier to reason about.
The same applies to batch processing, ETL pipelines, and many ML training workloads: the user is the wall-clock time of the whole batch, which is the sum (or maximum, depending on dependencies) of the per-task latencies. For these, the per-task percentile ladder is a debugging tool and the per-batch wall-clock mean is the operational metric. The lesson is not that the mean is always wrong but that the right summary is the one whose distribution maps to user experience — and for interactive services, that is always the percentile ladder.
The Hotstar streaming team sets two parallel metrics for the same playback path: a per-chunk percentile ladder for the team that owns the CDN-edge code (their job is to keep the per-chunk p99 inside the buffer-underrun budget) and a per-session re-buffering-rate for the team that owns the player (their job is to keep the user-visible re-buffering count below 1 per session per hour). Both teams measure the same fundamental data; the summaries are tuned to the questions each team can act on. A team that only had the per-chunk percentile would not know how often re-buffering actually showed up to the user; a team that only had the re-buffering rate would not know which of the per-chunk p99s was responsible. Picking the right summary is half the discipline; picking two right summaries for two different operational owners is the other half.
Reproduce this on your laptop
# Pure Python + numpy + hdrh, no kernel access required.
python3 -m venv .venv && source .venv/bin/activate
pip install numpy hdrh
# Run the same-mean experiment
python3 why_averages_lie.py
# Try a wider gap: bump the slow-mode probability
# (edit the script, set fast_prob from 0.92 down to 0.85; rerun)
# The two sample means stay locked together; the p99 gap blows up.
# Watch the variance lie under heavy tails: increase n by 10x
# and see how the standard deviation moves while the percentiles stabilise
for n in 10000 100000 1000000; do
python3 -c "
import numpy as np
rng = np.random.default_rng($n)
x = rng.lognormal(mean=np.log(80_000), sigma=1.6, size=$n)
print(f'n=$n mean={np.mean(x)/1000:.1f}ms std={np.std(x)/1000:.1f}ms p99={np.percentile(x,99)/1000:.1f}ms')"
done
You will see standard deviation move by a factor of 2 to 4 across runs while the p99 holds within ~5%. This is the variance-instability regime in action; if your dashboard's "±1σ" wanders that much between minutes, you have rediscovered the same effect in production.
Where this leads next
The argument so far is statistical: heavy-tailed distributions defeat the mean and the standard deviation, and the percentile ladder is the only summary that survives. The next chapter, Percentiles: p50, p99, p99.9, pins down what each rung actually measures, why p99.9 and p99.99 are the operational floor, and how to reason about which rung your particular service should report against. From there, Coordinated omission and HdrHistograms confronts the most common way teams report a tail-latency number that is silently wrong — their load generator pauses on slow responses and never records the worst samples in the first place. The Tail at Scale and fan-out makes the Dean-Barroso fan-out math from chapter 46 rigorous and walks through the production techniques (hedging, tied requests, replica selection) that real services use to fight it.
The single habit to take from this chapter into tomorrow morning: when someone hands you a "latency went down by 12%" claim with a mean as the source, the correct first response is "show me the percentile ladder before and after". Half the time the ladder will agree with the mean and the change is real. The other half the ladder will reveal that the mean moved because the bulk got faster while the tail got slower — or because the slow mode disappeared from the sample window for measurement reasons unrelated to the change. Either way, you get to the truth in one extra question, and you stop optimising for a number your users do not feel.
A second habit, smaller but worth naming: stop reading "±" annotations on latency dashboards. The "±" carries a Gaussian assumption that latency does not satisfy, and reading it as "central 68% of requests" is wrong by 10 to 30 percentage points on a typical heavy-tailed distribution. When a colleague says "the mean was 84 ms ± 12 ms", the right cognitive response is to silently translate that to "the bulk is somewhere around 84 ms and I have no idea what the tail is". Then ask for the ladder. The translation is mechanical once you internalise it; the conversation that follows is the one that actually matters.
A third habit, the longest-lived: when designing a new dashboard, panel, or alert, write down the user-facing event that the metric is supposed to track before picking the summary. "The user submits a payment and waits for the confirm" is a user-facing event; the right summary is the per-request percentile ladder of the confirm latency. "The system processes a 1-million-row ETL job" is a user-facing event; the right summary is the wall-clock mean of the job, plus a percentile ladder of per-task latencies as a debugging aid. The summary follows from the event; if you cannot articulate the event, you do not know which summary you need, and any number you put on the dashboard will be at best decorative. This is the discipline the senior SREs at Razorpay, Hotstar, Zerodha, and PhonePe have all converged on independently, and it is the discipline this chapter is trying to install.
The deeper habit, repeated from chapter 46 because it is the lesson of all of Part 7: summarise distributions with quantiles, not moments. The mean and the standard deviation are moments; they are the right tools for distributions whose moments are well-behaved (Gaussian, uniform, thin-tailed in general) and the wrong tools for the heavy-tailed distributions that pervade systems performance. Quantiles are the right tools for the latency you actually have. Internalising this rule is what separates engineers who fight tail-latency incidents successfully from engineers who keep tuning thresholds upward until the alarm stops firing.
The chapter after this one will pin down what the rungs of the ladder actually mean — what p99 includes and excludes, why p99.9 is the operational floor for fan-out-shaped services, what p99.99 is good for and what it is not, and how to choose the right rung for the call topology your service has. The chapter after that will confront the most common way teams report a tail number that is silently wrong: coordinated omission, where the load generator pauses on slow responses and never records the worst samples in the first place. Together, these three chapters — this one, percentiles, and coordinated omission — are the foundation of every honest latency conversation in Part 7. Every later chapter assumes them.
References
- Frank Anscombe, "Graphs in Statistical Analysis" (American Statistician, 1973) — the paper that introduced Anscombe's quartet and made the "summary statistics cannot replace looking at the data" argument that this chapter extends to latency.
- Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that crystallised the case against the mean for latency reporting and made HdrHistogram the de facto standard. Watch once at the start of your career, again every two years.
- HdrHistogram project page — the sub-bucketed histogram library that gives ~3 significant decimal digits of tail accuracy at fixed memory cost, with the percentile ladder as its native output format.
- Nassim Taleb, "Statistical Consequences of Fat Tails" (2020) — rigorous treatment of why moments fail under power-law tails, with explicit treatment of the variance-undefined regime that appears in real latency distributions.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2 — the canonical text's treatment of statistical methods for performance measurement, including the case against the mean and the case for percentiles.
- /wiki/wall-latency-lives-in-the-long-tail — the previous chapter, which set up the heavy-tailed shape of real latency distributions and the fan-out math that makes the tail user-visible.
- /wiki/coordinated-omission-and-hdr-histograms — the next-but-one chapter, which addresses the measurement-side discipline that ensures the tail you are reporting is the real tail.
- /wiki/percentiles-p50-p99-p999 — the next chapter, which pins down what each rung of the percentile ladder actually measures and why p99.9 is the operational floor for fan-out-shaped services.