Serving p99 latency under ingest pressure

Karan runs the growth-analytics platform at Meesho. The "GMV by city by hour" tile on the merchant dashboard answers in 70 ms during the morning. By 7 p.m. — when 4.2 million Indian merchants are simultaneously checking yesterday's earnings and Meesho's order-events stream is pushing 180k events/sec into the same ClickHouse cluster — the same tile takes 9 seconds at p99. The p50 has barely moved (90 ms); the p99 has collapsed by two orders of magnitude. Nobody added more data; nobody changed the query; nothing in the code path is broken. The cluster CPU graph shows 65 percent utilisation, well below the 80 percent he was warned about. And yet the dashboard is unusable for one user in a hundred at exactly the time the merchants need it.

A single OLAP cluster runs three workloads — ingestion, refresh of materialised views, and dashboard reads — that compete for CPU, memory, and IO bandwidth. Average utilisation hides the contention; queue depth at the contended resource is what explains p99. The fix is workload isolation (separate compute, scheduler classes, or admission control), not a bigger cluster.

The three workloads that share one cluster

A real-time OLAP system at Indian fintech or commerce scale runs three concurrent workloads on the same physical hardware. They look unrelated when you draw the architecture diagram; they fight for the same resources at the millisecond level.

The first is ingestion: a Kafka consumer pulling events from orders.cdc at 180k records/sec, parsing JSON, decoding Iceberg snapshots, writing to columnar storage. On Karan's cluster this runs as a separate process per shard, but the disk it writes to and the page cache it pollutes are shared with the readers. Every second of ingest writes ~14 MB/sec/shard of compressed data and dirties ~40 MB/sec/shard of page cache.

The second is materialised view refresh: the 2-minute MV that pre-aggregates daily-by-city GMV. Each refresh scans the partitions touched by the last 2 minutes of ingestion (~21M rows on a busy evening), computes HLL sketches and sums, writes the result back. The refresh job is a reader of the fact table and a writer of the MV — it competes with both ingestion (for write IO) and dashboards (for scan throughput).

The third is dashboard serving: 200 concurrent merchant users plus 50 internal analytics queries plus a fraud-detection cron firing every 30 seconds. These are short queries — 50 ms to 2 seconds each — but there are thousands of them per minute, and they all want page-cache pages that the ingestion job is busy invalidating.

Three workloads sharing one clusterA diagram of one ClickHouse cluster with three colour-coded workload streams hitting it simultaneously. The ingestion stream pushes 180k events per second into the fact table. The MV refresh job runs every 2 minutes, scanning recent partitions and writing the daily aggregate table. The dashboard serving stream takes 12,000 reads per hour from the MV and the fact table. All three meet at the cluster's CPU, page cache, and disk IO, where they compete. One cluster, three workloads, three different rhythms Kafka ingestion 180k events/sec continuous, write-heavy MV refresh every 2 min, 30-60 s burst scan + write Dashboard reads 12,000 q/hr, peaks at 7 pm tiny scans, latency-bound Shared cluster resources CPU cores · 96 vcpu Page cache · 240 GB Disk IO · 8 GB/sec aggregate Network · 25 Gbps p99 latency SLA
Three workloads with three different rhythms hit one set of physical resources. The dashboard's p99 SLA is shaped by whichever queue at those resources is deepest at the moment a query arrives.

The three rhythms are mismatched on purpose: ingestion is a smooth river, MV refresh is a 30-second pulse every 2 minutes, dashboard traffic is a Poisson process that spikes at 7 p.m. Their interaction is what kills p99. Why average utilisation is the wrong number: a CPU graph showing 65 percent average reads as healthy. But if 30 seconds out of every 120 the CPU is at 95 percent (during MV refresh), then 25 percent of dashboard queries arrive during the burst. Inside the burst, the queue depth at the CPU scheduler can be 10× longer than at idle — multiplying their wait time by 10×. Average utilisation aggregates this away; p99 latency exposes it.

Why p99 collapses while p50 stays calm

The mathematical reason p99 behaves so differently from p50 is queueing theory. When a resource (CPU core, disk IO, lock) is shared by Poisson-arriving requests, the expected wait time is governed by the M/M/1 formula — but the tail of the wait-time distribution explodes super-linearly as utilisation approaches 100 percent.

# Queueing-theory model of dashboard latency under ingest pressure
import numpy as np

np.random.seed(42)

# Model: M/M/1 queue at a single shared resource (e.g. one disk channel)
# Service rate mu = 1000 ops/sec  (1 ms per op average)
# Arrival rate varies; we measure p50 and p99 wait time

mu = 1000.0  # ops per second the resource can serve
lambdas = [400, 600, 700, 800, 850, 900, 950]
n_samples = 100_000

print(f"{'load %':>8} | {'p50 (ms)':>10} | {'p99 (ms)':>10} | {'p99/p50':>8}")
print("-" * 48)
for lam in lambdas:
    rho = lam / mu
    # Wait time in M/M/1 has CDF: P(W <= t) = 1 - rho * exp(-(mu - lam) * t)
    u = np.random.uniform(0, 1, n_samples)
    # invert: W = 0 with prob (1-rho), else exponential with rate (mu - lam)
    waiting = (u > (1 - rho))
    expo = np.random.exponential(1.0 / (mu - lam), n_samples)
    w = np.where(waiting, expo, 0.0) * 1000  # to ms
    p50 = np.percentile(w, 50)
    p99 = np.percentile(w, 99)
    print(f"{rho*100:>7.0f}% | {p50:>10.2f} | {p99:>10.2f} | {p99/max(p50,0.01):>8.0f}x")
# Output

  load % |   p50 (ms) |   p99 (ms) |  p99/p50
------------------------------------------------
     40% |       0.00 |       3.85 |     385x
     60% |       0.00 |       9.14 |     914x
     70% |       0.00 |      14.97 |    1497x
     80% |       0.27 |      24.61 |      91x
     85% |       1.70 |      35.42 |      21x
     90% |       4.55 |      55.81 |      12x
     95% |       12.10 |     119.30 |     10x

Walk through what happened. At 40 percent utilisation the median dashboard query waits zero milliseconds — most of the time the resource is idle when the query arrives. But the p99 query (one in a hundred) waits nearly 4 ms because it had the bad luck to arrive during a busy stretch. Why p99 is non-zero even at 40 percent load: even at low utilisation, the inter-arrival times of requests are exponentially distributed, so occasionally three requests arrive within the time it takes to serve one. The third request waits for the second, which waits for the first. Tail latency is the cost of arrival-rate variance, not average utilisation. At 80 percent load the p50 has barely moved — half the queries still see no queue. But the p99 has climbed to 25 ms. Push to 95 percent and p99 hits 120 ms while p50 is 12 ms, a 10× gap.

This is the shape Karan sees on his dashboard. The cluster CPU graph at 65 percent average looks fine. But during the 30-second MV-refresh burst the CPU is at 95 percent and his queries hit the steep part of this curve. The average over 2 minutes is 65 percent, well below the danger zone; the instantaneous utilisation during the burst is what determines the p99 of queries that arrive in that window.

The same story repeats at every shared resource. Disk IO bandwidth is 8 GB/sec on the cluster; ingestion uses 1.5 GB/sec steady, MV refresh adds 4 GB/sec for 30 seconds, dashboard reads add 0.8 GB/sec at peak. The aggregate is 6.3 GB/sec — under the limit on average, but during refresh × peak overlap it momentarily hits 9.5 GB/sec, queries queue up at the device, and p99 explodes.

Where contention actually happens

The CPU graph is the wrong place to look. Five other resources contend more harmfully and show up as p99 spikes before CPU does.

Page cache eviction. ClickHouse and StarRocks rely heavily on the OS page cache to serve repeat reads. A dashboard query reading the last 7 days of orders_daily_by_city MV touches ~50 MB; if the data is in cache, the query is 30 ms. If ingestion has written 40 MB/sec for the last 30 minutes, it has cycled 72 GB through the page cache — long enough to evict the MV's hot pages. The next dashboard query is a cold read, 800 ms. Why ingestion evicts read-side cache: the Linux page-cache replacement policy is a variant of LRU. Newly written pages are inserted at the head of the LRU list with full priority. If the write rate exceeds the read rate, write pages push read pages off the tail. There is no separation between "pages I'm reading repeatedly for dashboards" and "pages I just wrote and won't read until the next merge." The kernel cannot tell.

Disk IO queue depth at the device. A NVMe SSD can sustain ~600k IOPS at low queue depth and ~1M IOPS at queue depth 32 — but each individual operation's latency climbs from 100 µs to 800 µs as the queue fills. Ingestion's writes typically arrive at queue depth 8–16; MV refresh adds another queue depth 8 burst; concurrent dashboard reads now see 24-deep queue and pay the cost. The disk's throughput graph still looks healthy (it is happily doing 800k IOPS); the dashboard's latency graph is on fire.

Memory pressure and the OOM killer. ClickHouse's per-query memory limit is set per workload class. If ingestion is using 40 GB of memory for in-flight Kafka batches and the MV refresh kicks off a 30 GB hash aggregation, the cluster has only 26 GB left for dashboard queries on a 96 GB node. Queries that need more get aborted with Memory limit exceeded. p99 isn't just slow — it's failure. Each failed query is a 500 in the API gateway and a "Try again later" toast in the merchant's browser.

Distributed-query coordination locks. Both ClickHouse's distributed Distributed table and StarRocks' coordinator BE serialize parts of plan distribution. A long-running MV refresh holding a metadata read-lock blocks dashboard queries that need to enumerate the same partitions. The query thread is not CPU-busy, the disk is fine, but pg_locks (or its equivalent) shows 200 sessions waiting on one lock holder.

ZooKeeper / Keeper / metastore round-trips. ClickHouse on Keeper makes a round-trip per part-merge to record metadata. Under heavy ingestion the merge rate climbs to 50/sec and the metadata service becomes the bottleneck — every dashboard query that needs to list parts now waits behind 50 metadata operations.

# Where contention actually showed up at Meesho during 7 pm peak

Resource              | Avg load | p99 wait | Workload causing it
----------------------|----------|----------|----------------------
CPU cores             | 65%      | 8 ms     | MV refresh burst
Page cache            | 92% full | 600 ms   | Ingest evicts read pages
Disk IO queue depth   | 14       | 280 ms   | All three converge at 7 pm
Memory                | 78%      | -        | OOM on 0.4% of queries
ZK round-trip         | 12 ms    | 350 ms   | Merge rate during ingest

The p99 chase is a debugging puzzle: which queue is full right now? The CPU graph is rarely the answer; it lags by 30 seconds and is averaged over a window long enough to hide the bursts.

Three real fixes

The wrong fix is "add more nodes." Doubling the cluster halves the average load on each resource but doesn't help the p99 if the burst pattern remains. A cluster at 32 percent average with 95 percent bursts has the same p99 as one at 65 percent average with 95 percent bursts. The fixes that work change the interaction between workloads.

Fix 1: workload isolation (the hardware separation)

Run ingestion on one set of nodes and serving on another. ClickHouse, StarRocks, and Pinot all support cluster topologies where the data is replicated to multiple replica groups and queries can be routed to a "serving" replica that is read-only for the dashboard. Ingestion writes to the "ingest" replica; the data replicates over the network. The dashboard cluster never sees a write IO.

The cost is duplicate hardware (~1.7× for full read-replica isolation) and an extra hop of replication latency (~30 sec for ClickHouse's ReplicatedMergeTree, faster for cloud-native engines like StarRocks-on-S3). For Meesho this means an extra ₹3.4 lakh/month in infrastructure to drop dashboard p99 from 9 seconds to 180 ms. For a merchant-facing product where 7 p.m. is the most valuable hour of the day, the trade is obvious. For a back-office team where the dashboard is loaded twice a day, it is wasteful.

Fix 2: workload management (the scheduler-class fix)

Both ClickHouse 23+ and StarRocks 3.0+ implement workload classes — declarative quotas on CPU, memory, and concurrency that the engine's scheduler enforces. A dashboard query is tagged workload = 'serving'; an MV refresh is tagged workload = 'maintenance'; the scheduler ensures serving always has at least 50 percent of CPU even during a refresh burst.

-- StarRocks workload group with priority and quota
CREATE RESOURCE GROUP serving_high
   TO (ROLE = 'dashboard_user')
   WITH ('cpu_core_limit' = '48',
         'mem_limit'      = '40%',
         'concurrency_limit' = '200',
         'priority' = '8');

CREATE RESOURCE GROUP maintenance
   TO (USER = 'mv_refresh_job')
   WITH ('cpu_core_limit' = '24',
         'mem_limit'      = '30%',
         'priority' = '2');

This drops p99 from 9 sec to 1.2 sec without adding hardware. The cost is bounded: the MV refresh is now slower (it gets at most 24 CPU cores instead of all 96), so the staleness window grows from 2 minutes to 3.5 minutes. For most workloads this is acceptable.

Fix 3: bound the burst (the soft-knee solution)

Instead of running MV refresh as a 30-second burst every 2 minutes, run it as a continuous 5 percent CPU drain. The same total work is done; the burst is flattened. ClickHouse's MaterializedView with LIVE VIEW semantics, StarRocks' incremental MVs, and Materialize's differential dataflow all push in this direction — process every record once on arrival, with bounded per-record work, instead of batch-recomputing.

The trade is implementation complexity: the engine has to maintain incremental state correctly under inserts, updates, deletes, and out-of-order events. Watermarks (Build 10) become load-bearing. The latency win is that dashboard p99 no longer sees a 30-second burst because the burst no longer exists.

Burst refresh vs flattened refreshA timeline plot showing CPU utilisation over 8 minutes for two refresh strategies. The top trace is bursty: 30-second spikes to 95 percent every 2 minutes, otherwise around 50 percent. The bottom trace is flattened: a steady 65 percent line with no spikes. Beneath each trace is a strip showing dashboard p99 latency: the bursty version has spikes to 9 seconds aligned with the CPU spikes, while the flat version has p99 holding at 200 ms throughout. Same work, different shape, very different p99 Bursty refresh (every 2 min) 100% 0% CPU p99 latency strip 9 s spikes Flattened refresh (continuous incremental) CPU 65% p99 = 200 ms throughout
The bursty refresh does the same total work, but during each burst the dashboard's p99 spikes by 50×. Flattening the refresh into a continuous incremental computation removes the spikes — at the cost of a more complex maintenance algorithm.

When the cluster is the wrong abstraction

Three workload shapes where putting ingestion and serving on one cluster is fundamentally wrong, regardless of how clever the workload management is.

Mixed-tenant SaaS where one tenant's burst hurts everyone. If Razorpay's settlement-summary dashboard runs on the same cluster as a third-party analytics tenant that decides to run a SELECT * on a 4 TB table, isolation by tenant is more important than isolation by workload. The fix is multi-tenant routing (Build 16) — different tenants pinned to different node pools, hard limits on per-tenant CPU.

Cold-tier scans on a hot cluster. A user runs a 90-day-window query on a 7-day-cache cluster; the missing 83 days are pulled from S3 at 200 MB/sec. The single query saturates the network for 5 minutes, p99 across the whole cluster goes to 30 sec. Cold scans must run on a separate query path with its own bandwidth budget — Snowflake's compute warehouses, Databricks' SQL warehouses, and StarRocks' storage-compute split all enforce this.

Ingestion bursts that exceed the per-node write budget. During Flipkart Big Billion Days the order-events stream goes from 200k events/sec to 4M events/sec for 4 hours. No clever workload management on a fixed cluster will absorb this; the fix is autoscaling the ingestion path independently of the serving path. Cloud-native architectures (StarRocks on Iceberg, Pinot on cloud-tier) make this possible because the storage is shared and compute can scale per-workload.

Common confusions

Going deeper

Little's Law and the queue-depth diagnostic

Little's Law (L = λW) connects the three observable quantities of any queueing system: the average number of items in the system L, the arrival rate λ, and the average time-in-system W. For a dashboard cluster, L is the number of queries currently executing or waiting, λ is queries per second arriving, and W is the average end-to-end latency. The relationship holds independent of arrival distribution, service distribution, and number of servers — making it the most useful diagnostic in the engineer's toolkit.

The practical use: if the dashboard is serving 200 queries per second and the average query takes 0.5 sec, L = 100 — a hundred queries should always be in flight. If the metrics show 300 in flight, something is wedged: queries are arriving faster than they leave, the queue is growing, p99 is about to collapse. Conversely, if L < 100 and latency is bad, the cluster isn't queueing — the slow queries are all individually slow, which points at a different problem (cold reads, lock contention, GC).

The Dynamo paper and tail-tolerance patterns

Amazon's Dynamo paper (DeCandia et al., 2007) introduced the trick of hedged requests: send the same query to two replicas, return whichever responds first, cancel the loser. The trick exploits the fact that p99 latency is much greater than p50 — at 95 percent load the p99/p50 ratio in our model was 10×, so even one extra replica request cuts the worst case dramatically. Google's "Tail at Scale" paper (Dean & Barroso, 2013) generalised the pattern: tied requests, micro-partitioning, selective replication.

The Indian fintech operational pattern: every cross-zone read at Razorpay is hedged. A read that should take 8 ms p50 across the AZ boundary occasionally takes 200 ms when the network has microbursts; the hedge to a same-AZ replica completes in 12 ms and the cross-AZ read is cancelled. The cost is 1.05× extra reads (because most reads return before the hedge fires); the benefit is p99 drops from 200 ms to 18 ms.

Concurrent merges, vacuum, and the LSM stall

LSM-tree storage engines (RocksDB, ClickHouse's MergeTree) periodically rewrite levels to compact small files. The merge job is an internal workload that doesn't show up in the engine's "ingest" or "serving" buckets — it consumes disk IO and CPU on its own schedule. During a heavy merge, p99 spikes for the same queueing-theory reasons as any other burst. The classical fix is rate-limiting the merge ("compact at most 200 MB/sec/node"); the modern fix is tiered storage where the merging happens on a separate compute pool.

Iceberg's metadata-only optimisation (the Build 14 cost-attribution chapter touches this) avoids LSM-style merges by treating compaction as a separate scheduled job that the team can pin to off-peak hours. Trino, Spark, and StarRocks read Iceberg without compaction interference. The cost is more S3 list-prefix calls per query (because there are more small files to enumerate); the benefit is no merge-induced p99 spikes.

What the chaos-engineering test actually measures

Meesho's reliability team runs a quarterly "ingest storm" drill: artificially push the order-events stream from 180k/sec to 600k/sec for 10 minutes while monitoring dashboard p99. The drill reveals which queues fill up first — usually the page cache, occasionally ZooKeeper. The team then tunes the most-stressed resource: more memory for cache, sharded Keeper for metadata. Without the drill, the failure mode shows up only during Big Billion Days when the cost of debugging is highest. With it, the team has practised the exact failure mode at a time of their choosing.

Where this leads next

The pattern in this chapter is older than OLAP: any system where multiple workloads share a finite resource will have p99 problems that average utilisation hides. Mainframe operators in the 1970s tuned IBM's MVS scheduler classes with the same trade-offs that ClickHouse engineers tune today; the queueing-theory math has not changed in fifty years. What has changed is the cost: a 9-second p99 on a leadership dashboard is, in 2026, a CEO-visible problem within an hour. The 1970s sysadmin had until Tuesday's batch report to fix it.

References