Serving p99 latency under ingest pressure
Karan runs the growth-analytics platform at Meesho. The "GMV by city by hour" tile on the merchant dashboard answers in 70 ms during the morning. By 7 p.m. — when 4.2 million Indian merchants are simultaneously checking yesterday's earnings and Meesho's order-events stream is pushing 180k events/sec into the same ClickHouse cluster — the same tile takes 9 seconds at p99. The p50 has barely moved (90 ms); the p99 has collapsed by two orders of magnitude. Nobody added more data; nobody changed the query; nothing in the code path is broken. The cluster CPU graph shows 65 percent utilisation, well below the 80 percent he was warned about. And yet the dashboard is unusable for one user in a hundred at exactly the time the merchants need it.
A single OLAP cluster runs three workloads — ingestion, refresh of materialised views, and dashboard reads — that compete for CPU, memory, and IO bandwidth. Average utilisation hides the contention; queue depth at the contended resource is what explains p99. The fix is workload isolation (separate compute, scheduler classes, or admission control), not a bigger cluster.
The three workloads that share one cluster
A real-time OLAP system at Indian fintech or commerce scale runs three concurrent workloads on the same physical hardware. They look unrelated when you draw the architecture diagram; they fight for the same resources at the millisecond level.
The first is ingestion: a Kafka consumer pulling events from orders.cdc at 180k records/sec, parsing JSON, decoding Iceberg snapshots, writing to columnar storage. On Karan's cluster this runs as a separate process per shard, but the disk it writes to and the page cache it pollutes are shared with the readers. Every second of ingest writes ~14 MB/sec/shard of compressed data and dirties ~40 MB/sec/shard of page cache.
The second is materialised view refresh: the 2-minute MV that pre-aggregates daily-by-city GMV. Each refresh scans the partitions touched by the last 2 minutes of ingestion (~21M rows on a busy evening), computes HLL sketches and sums, writes the result back. The refresh job is a reader of the fact table and a writer of the MV — it competes with both ingestion (for write IO) and dashboards (for scan throughput).
The third is dashboard serving: 200 concurrent merchant users plus 50 internal analytics queries plus a fraud-detection cron firing every 30 seconds. These are short queries — 50 ms to 2 seconds each — but there are thousands of them per minute, and they all want page-cache pages that the ingestion job is busy invalidating.
The three rhythms are mismatched on purpose: ingestion is a smooth river, MV refresh is a 30-second pulse every 2 minutes, dashboard traffic is a Poisson process that spikes at 7 p.m. Their interaction is what kills p99. Why average utilisation is the wrong number: a CPU graph showing 65 percent average reads as healthy. But if 30 seconds out of every 120 the CPU is at 95 percent (during MV refresh), then 25 percent of dashboard queries arrive during the burst. Inside the burst, the queue depth at the CPU scheduler can be 10× longer than at idle — multiplying their wait time by 10×. Average utilisation aggregates this away; p99 latency exposes it.
Why p99 collapses while p50 stays calm
The mathematical reason p99 behaves so differently from p50 is queueing theory. When a resource (CPU core, disk IO, lock) is shared by Poisson-arriving requests, the expected wait time is governed by the M/M/1 formula — but the tail of the wait-time distribution explodes super-linearly as utilisation approaches 100 percent.
# Queueing-theory model of dashboard latency under ingest pressure
import numpy as np
np.random.seed(42)
# Model: M/M/1 queue at a single shared resource (e.g. one disk channel)
# Service rate mu = 1000 ops/sec (1 ms per op average)
# Arrival rate varies; we measure p50 and p99 wait time
mu = 1000.0 # ops per second the resource can serve
lambdas = [400, 600, 700, 800, 850, 900, 950]
n_samples = 100_000
print(f"{'load %':>8} | {'p50 (ms)':>10} | {'p99 (ms)':>10} | {'p99/p50':>8}")
print("-" * 48)
for lam in lambdas:
rho = lam / mu
# Wait time in M/M/1 has CDF: P(W <= t) = 1 - rho * exp(-(mu - lam) * t)
u = np.random.uniform(0, 1, n_samples)
# invert: W = 0 with prob (1-rho), else exponential with rate (mu - lam)
waiting = (u > (1 - rho))
expo = np.random.exponential(1.0 / (mu - lam), n_samples)
w = np.where(waiting, expo, 0.0) * 1000 # to ms
p50 = np.percentile(w, 50)
p99 = np.percentile(w, 99)
print(f"{rho*100:>7.0f}% | {p50:>10.2f} | {p99:>10.2f} | {p99/max(p50,0.01):>8.0f}x")
# Output
load % | p50 (ms) | p99 (ms) | p99/p50
------------------------------------------------
40% | 0.00 | 3.85 | 385x
60% | 0.00 | 9.14 | 914x
70% | 0.00 | 14.97 | 1497x
80% | 0.27 | 24.61 | 91x
85% | 1.70 | 35.42 | 21x
90% | 4.55 | 55.81 | 12x
95% | 12.10 | 119.30 | 10x
Walk through what happened. At 40 percent utilisation the median dashboard query waits zero milliseconds — most of the time the resource is idle when the query arrives. But the p99 query (one in a hundred) waits nearly 4 ms because it had the bad luck to arrive during a busy stretch. Why p99 is non-zero even at 40 percent load: even at low utilisation, the inter-arrival times of requests are exponentially distributed, so occasionally three requests arrive within the time it takes to serve one. The third request waits for the second, which waits for the first. Tail latency is the cost of arrival-rate variance, not average utilisation. At 80 percent load the p50 has barely moved — half the queries still see no queue. But the p99 has climbed to 25 ms. Push to 95 percent and p99 hits 120 ms while p50 is 12 ms, a 10× gap.
This is the shape Karan sees on his dashboard. The cluster CPU graph at 65 percent average looks fine. But during the 30-second MV-refresh burst the CPU is at 95 percent and his queries hit the steep part of this curve. The average over 2 minutes is 65 percent, well below the danger zone; the instantaneous utilisation during the burst is what determines the p99 of queries that arrive in that window.
The same story repeats at every shared resource. Disk IO bandwidth is 8 GB/sec on the cluster; ingestion uses 1.5 GB/sec steady, MV refresh adds 4 GB/sec for 30 seconds, dashboard reads add 0.8 GB/sec at peak. The aggregate is 6.3 GB/sec — under the limit on average, but during refresh × peak overlap it momentarily hits 9.5 GB/sec, queries queue up at the device, and p99 explodes.
Where contention actually happens
The CPU graph is the wrong place to look. Five other resources contend more harmfully and show up as p99 spikes before CPU does.
Page cache eviction. ClickHouse and StarRocks rely heavily on the OS page cache to serve repeat reads. A dashboard query reading the last 7 days of orders_daily_by_city MV touches ~50 MB; if the data is in cache, the query is 30 ms. If ingestion has written 40 MB/sec for the last 30 minutes, it has cycled 72 GB through the page cache — long enough to evict the MV's hot pages. The next dashboard query is a cold read, 800 ms. Why ingestion evicts read-side cache: the Linux page-cache replacement policy is a variant of LRU. Newly written pages are inserted at the head of the LRU list with full priority. If the write rate exceeds the read rate, write pages push read pages off the tail. There is no separation between "pages I'm reading repeatedly for dashboards" and "pages I just wrote and won't read until the next merge." The kernel cannot tell.
Disk IO queue depth at the device. A NVMe SSD can sustain ~600k IOPS at low queue depth and ~1M IOPS at queue depth 32 — but each individual operation's latency climbs from 100 µs to 800 µs as the queue fills. Ingestion's writes typically arrive at queue depth 8–16; MV refresh adds another queue depth 8 burst; concurrent dashboard reads now see 24-deep queue and pay the cost. The disk's throughput graph still looks healthy (it is happily doing 800k IOPS); the dashboard's latency graph is on fire.
Memory pressure and the OOM killer. ClickHouse's per-query memory limit is set per workload class. If ingestion is using 40 GB of memory for in-flight Kafka batches and the MV refresh kicks off a 30 GB hash aggregation, the cluster has only 26 GB left for dashboard queries on a 96 GB node. Queries that need more get aborted with Memory limit exceeded. p99 isn't just slow — it's failure. Each failed query is a 500 in the API gateway and a "Try again later" toast in the merchant's browser.
Distributed-query coordination locks. Both ClickHouse's distributed Distributed table and StarRocks' coordinator BE serialize parts of plan distribution. A long-running MV refresh holding a metadata read-lock blocks dashboard queries that need to enumerate the same partitions. The query thread is not CPU-busy, the disk is fine, but pg_locks (or its equivalent) shows 200 sessions waiting on one lock holder.
ZooKeeper / Keeper / metastore round-trips. ClickHouse on Keeper makes a round-trip per part-merge to record metadata. Under heavy ingestion the merge rate climbs to 50/sec and the metadata service becomes the bottleneck — every dashboard query that needs to list parts now waits behind 50 metadata operations.
# Where contention actually showed up at Meesho during 7 pm peak
Resource | Avg load | p99 wait | Workload causing it
----------------------|----------|----------|----------------------
CPU cores | 65% | 8 ms | MV refresh burst
Page cache | 92% full | 600 ms | Ingest evicts read pages
Disk IO queue depth | 14 | 280 ms | All three converge at 7 pm
Memory | 78% | - | OOM on 0.4% of queries
ZK round-trip | 12 ms | 350 ms | Merge rate during ingest
The p99 chase is a debugging puzzle: which queue is full right now? The CPU graph is rarely the answer; it lags by 30 seconds and is averaged over a window long enough to hide the bursts.
Three real fixes
The wrong fix is "add more nodes." Doubling the cluster halves the average load on each resource but doesn't help the p99 if the burst pattern remains. A cluster at 32 percent average with 95 percent bursts has the same p99 as one at 65 percent average with 95 percent bursts. The fixes that work change the interaction between workloads.
Fix 1: workload isolation (the hardware separation)
Run ingestion on one set of nodes and serving on another. ClickHouse, StarRocks, and Pinot all support cluster topologies where the data is replicated to multiple replica groups and queries can be routed to a "serving" replica that is read-only for the dashboard. Ingestion writes to the "ingest" replica; the data replicates over the network. The dashboard cluster never sees a write IO.
The cost is duplicate hardware (~1.7× for full read-replica isolation) and an extra hop of replication latency (~30 sec for ClickHouse's ReplicatedMergeTree, faster for cloud-native engines like StarRocks-on-S3). For Meesho this means an extra ₹3.4 lakh/month in infrastructure to drop dashboard p99 from 9 seconds to 180 ms. For a merchant-facing product where 7 p.m. is the most valuable hour of the day, the trade is obvious. For a back-office team where the dashboard is loaded twice a day, it is wasteful.
Fix 2: workload management (the scheduler-class fix)
Both ClickHouse 23+ and StarRocks 3.0+ implement workload classes — declarative quotas on CPU, memory, and concurrency that the engine's scheduler enforces. A dashboard query is tagged workload = 'serving'; an MV refresh is tagged workload = 'maintenance'; the scheduler ensures serving always has at least 50 percent of CPU even during a refresh burst.
-- StarRocks workload group with priority and quota
CREATE RESOURCE GROUP serving_high
TO (ROLE = 'dashboard_user')
WITH ('cpu_core_limit' = '48',
'mem_limit' = '40%',
'concurrency_limit' = '200',
'priority' = '8');
CREATE RESOURCE GROUP maintenance
TO (USER = 'mv_refresh_job')
WITH ('cpu_core_limit' = '24',
'mem_limit' = '30%',
'priority' = '2');
This drops p99 from 9 sec to 1.2 sec without adding hardware. The cost is bounded: the MV refresh is now slower (it gets at most 24 CPU cores instead of all 96), so the staleness window grows from 2 minutes to 3.5 minutes. For most workloads this is acceptable.
Fix 3: bound the burst (the soft-knee solution)
Instead of running MV refresh as a 30-second burst every 2 minutes, run it as a continuous 5 percent CPU drain. The same total work is done; the burst is flattened. ClickHouse's MaterializedView with LIVE VIEW semantics, StarRocks' incremental MVs, and Materialize's differential dataflow all push in this direction — process every record once on arrival, with bounded per-record work, instead of batch-recomputing.
The trade is implementation complexity: the engine has to maintain incremental state correctly under inserts, updates, deletes, and out-of-order events. Watermarks (Build 10) become load-bearing. The latency win is that dashboard p99 no longer sees a 30-second burst because the burst no longer exists.
When the cluster is the wrong abstraction
Three workload shapes where putting ingestion and serving on one cluster is fundamentally wrong, regardless of how clever the workload management is.
Mixed-tenant SaaS where one tenant's burst hurts everyone. If Razorpay's settlement-summary dashboard runs on the same cluster as a third-party analytics tenant that decides to run a SELECT * on a 4 TB table, isolation by tenant is more important than isolation by workload. The fix is multi-tenant routing (Build 16) — different tenants pinned to different node pools, hard limits on per-tenant CPU.
Cold-tier scans on a hot cluster. A user runs a 90-day-window query on a 7-day-cache cluster; the missing 83 days are pulled from S3 at 200 MB/sec. The single query saturates the network for 5 minutes, p99 across the whole cluster goes to 30 sec. Cold scans must run on a separate query path with its own bandwidth budget — Snowflake's compute warehouses, Databricks' SQL warehouses, and StarRocks' storage-compute split all enforce this.
Ingestion bursts that exceed the per-node write budget. During Flipkart Big Billion Days the order-events stream goes from 200k events/sec to 4M events/sec for 4 hours. No clever workload management on a fixed cluster will absorb this; the fix is autoscaling the ingestion path independently of the serving path. Cloud-native architectures (StarRocks on Iceberg, Pinot on cloud-tier) make this possible because the storage is shared and compute can scale per-workload.
Common confusions
- "p99 is just three p50 outliers per 100 queries." No. p99 is the 99th-percentile of the latency distribution, which under contention is heavy-tailed. Doubling the load doesn't double p99 — it can multiply it by 10×. The relationship is non-linear and dictated by queueing theory.
- "If average CPU is 65 percent, the cluster has plenty of headroom." Average is a 60-second window; the workload that hurts p99 is a 200-ms burst. A cluster at 65 percent average can be at 100 percent for 30 seconds out of every 120 — exactly the window in which the dashboard's p99 query arrives.
- "Adding more nodes always reduces p99." It reduces average load on each node. If the burst pattern is unchanged (refresh still runs every 2 minutes, ingestion still spikes at 7 p.m.), each node still goes through the same percentage of busy bursts. p99 inside a burst is unchanged; p99 averaged across a quarter goes down only because more queries land in idle windows.
- "Workload isolation is a premium feature for big customers." It is the load-bearing fix at any scale where ingestion and serving share hardware. ClickHouse 23+, StarRocks 3.0+, and Trino's resource groups all ship workload management free; the cost is configuring it right.
- "Page cache hit rate is the only memory metric that matters." Cache hit rate is an aggregate over time; the dashboard's p99 is determined by what was in cache 50 ms ago, not what was in cache on average over the last hour. Tracking cold reads per minute is a better proxy for the dashboard's tail latency than overall hit ratio.
- "Tail latency is just bad luck — there's no fix." Tail latency is bad luck amplified by queueing. Reduce the queue depth (by isolating workloads, capping concurrency, flattening bursts) and the tail compresses. The Google paper "The Tail at Scale" is the canonical statement; every OLAP team rediscovers it.
Going deeper
Little's Law and the queue-depth diagnostic
Little's Law (L = λW) connects the three observable quantities of any queueing system: the average number of items in the system L, the arrival rate λ, and the average time-in-system W. For a dashboard cluster, L is the number of queries currently executing or waiting, λ is queries per second arriving, and W is the average end-to-end latency. The relationship holds independent of arrival distribution, service distribution, and number of servers — making it the most useful diagnostic in the engineer's toolkit.
The practical use: if the dashboard is serving 200 queries per second and the average query takes 0.5 sec, L = 100 — a hundred queries should always be in flight. If the metrics show 300 in flight, something is wedged: queries are arriving faster than they leave, the queue is growing, p99 is about to collapse. Conversely, if L < 100 and latency is bad, the cluster isn't queueing — the slow queries are all individually slow, which points at a different problem (cold reads, lock contention, GC).
The Dynamo paper and tail-tolerance patterns
Amazon's Dynamo paper (DeCandia et al., 2007) introduced the trick of hedged requests: send the same query to two replicas, return whichever responds first, cancel the loser. The trick exploits the fact that p99 latency is much greater than p50 — at 95 percent load the p99/p50 ratio in our model was 10×, so even one extra replica request cuts the worst case dramatically. Google's "Tail at Scale" paper (Dean & Barroso, 2013) generalised the pattern: tied requests, micro-partitioning, selective replication.
The Indian fintech operational pattern: every cross-zone read at Razorpay is hedged. A read that should take 8 ms p50 across the AZ boundary occasionally takes 200 ms when the network has microbursts; the hedge to a same-AZ replica completes in 12 ms and the cross-AZ read is cancelled. The cost is 1.05× extra reads (because most reads return before the hedge fires); the benefit is p99 drops from 200 ms to 18 ms.
Concurrent merges, vacuum, and the LSM stall
LSM-tree storage engines (RocksDB, ClickHouse's MergeTree) periodically rewrite levels to compact small files. The merge job is an internal workload that doesn't show up in the engine's "ingest" or "serving" buckets — it consumes disk IO and CPU on its own schedule. During a heavy merge, p99 spikes for the same queueing-theory reasons as any other burst. The classical fix is rate-limiting the merge ("compact at most 200 MB/sec/node"); the modern fix is tiered storage where the merging happens on a separate compute pool.
Iceberg's metadata-only optimisation (the Build 14 cost-attribution chapter touches this) avoids LSM-style merges by treating compaction as a separate scheduled job that the team can pin to off-peak hours. Trino, Spark, and StarRocks read Iceberg without compaction interference. The cost is more S3 list-prefix calls per query (because there are more small files to enumerate); the benefit is no merge-induced p99 spikes.
What the chaos-engineering test actually measures
Meesho's reliability team runs a quarterly "ingest storm" drill: artificially push the order-events stream from 180k/sec to 600k/sec for 10 minutes while monitoring dashboard p99. The drill reveals which queues fill up first — usually the page cache, occasionally ZooKeeper. The team then tunes the most-stressed resource: more memory for cache, sharded Keeper for metadata. Without the drill, the failure mode shows up only during Big Billion Days when the cost of debugging is highest. With it, the team has practised the exact failure mode at a time of their choosing.
Where this leads next
- /wiki/cost-attribution-which-team-paid-for-that-query — Build 16; how to charge the dashboard's p99 cost back to the team that owns the dashboard, so the trade between hardware and SLA is visible to whoever sets the budget.
- /wiki/multi-tenant-routing-and-noisy-neighbours — Build 16; the per-tenant version of the workload-isolation problem covered here at the per-workload level.
- /wiki/pre-aggregation-materialized-views-and-their-costs — the previous chapter; the refresh job is the most common source of MV-induced p99 spikes.
The pattern in this chapter is older than OLAP: any system where multiple workloads share a finite resource will have p99 problems that average utilisation hides. Mainframe operators in the 1970s tuned IBM's MVS scheduler classes with the same trade-offs that ClickHouse engineers tune today; the queueing-theory math has not changed in fifty years. What has changed is the cost: a 9-second p99 on a leadership dashboard is, in 2026, a CEO-visible problem within an hour. The 1970s sysadmin had until Tuesday's batch report to fix it.
References
- The Tail at Scale (Dean & Barroso, 2013) — the canonical paper on why tail latency dominates user-perceived performance and how to hedge against it.
- Dynamo: Amazon's Highly Available Key-value Store (DeCandia et al., 2007) — origin of hedged requests and tied requests as a tail-latency mitigation.
- Queueing Theory and the M/M/1 Queue (Kleinrock, 1975) — the foundational text. Chapter 2 derives the wait-time distribution used in the Python model above.
- ClickHouse workload management documentation — the engine-side reference for resource groups and per-query quotas.
- StarRocks resource group documentation — declarative workload classes and how the scheduler enforces them.
- Linux page cache replacement policy (LWN) — why ingestion writes evict dashboard reads from the page cache, and what the kernel can and cannot do about it.
- /wiki/pre-aggregation-materialized-views-and-their-costs — chapter 109; the refresh-job behaviour that creates the contention this chapter is about.
- /wiki/clickhouse-columnar-for-real-time — chapter 105; the engine-side context for the cluster topology and ingestion path described here.