Multi-tenant warehouses: isolation and noisy neighbours

At 03:14 on a Tuesday, the on-call data engineer at a Bengaluru fintech got paged: the fraud team's nightly feature build was running fifteen minutes late and the marketing dashboard had started returning empty results. Both teams had separate dbt projects, separate schemas, separate warehouses on paper. Both were billed to different cost centres. Both shared one Snowflake account. The fraud-team's MERGE was scanning 8 TB of card transactions against a Cartesian-shaped predicate and had grabbed every available virtual-warehouse credit in the account; the marketing team's small dashboard query had been queued for eleven minutes waiting for compute. Neither team had done anything wrong by the rules they had been given. The warehouse just had no rule that said "if fraud's query is misbehaving, marketing still gets to render".

A multi-tenant warehouse looks like one database to the billing team and like a shared OS to the engineering team. The hard problem is not access control — it is resource isolation: one tenant must not be able to consume so much CPU, memory, I/O, or queue slots that other tenants stop making progress. Every commercial warehouse solves this with a different mix of dedicated compute pools, query queues, resource governors, and admission control, and every solution leaks somewhere.

What "multi-tenant" actually means inside a warehouse

A warehouse is multi-tenant the moment two independent teams can submit queries that compete for the same physical resource. The physical resource is usually one of four things: CPU cores on a worker node, memory on that node, I/O bandwidth from object storage, and queue slots in a scheduler. If two tenants share any one of these, they can starve each other.

The naive expectation a platform team brings from microservices is wrong. In a microservice world, "multi-tenancy" mostly means namespaces — different teams have different Kubernetes namespaces, different DB schemas, different IAM roles. Resource competition happens, but the OS scheduler and cgroups put hard limits on what one container can grab. A warehouse has no cgroups around a query. A query is a directed acyclic plan that fans out across worker nodes, grabs whatever memory it claims to need, and runs to completion. If the planner is wrong about its memory estimate, the query spills, slows down, and now holds a worker for ten minutes that twenty other tenants wanted.

Four shared resources where tenants collideDiagram showing four physical resources inside a warehouse — CPU cores, memory, object-storage I/O, and queue slots — with two tenants Alpha and Beta competing for each, and arrows from tenants to resources illustrating contention paths. Where tenants actually collide inside the warehouse Tenant: fraud 8 TB MERGE Tenant: marketing 200 ms dashboard CPU cores (worker nodes) Worker memory (RAM) Object-storage I/O Queue / admission slots spill, scan, sort, hash join hash tables, sort buffers S3 GET bandwidth quota concurrency cap, queue head Solid arrows: heavy tenant grabs all four. Dashed: small tenant queues behind them.
Isolation only works if every one of the four contention paths has a separate cap or pool. Cap CPU but share memory and one tenant still spills the other into a queue. This is why Snowflake, BigQuery, and Redshift each look different — they cap different paths.

A warehouse vendor's job is to put guardrails on each of these four resources so that one tenant cannot fully consume the resource the way the fraud team's MERGE did at 03:14. The catch is that the four resources have to be capped together, not one at a time. Capping CPU but not memory just shifts the failure: the same query now spills to disk, holds memory for longer, and starves the marketing dashboard through I/O instead of CPU.

The three isolation strategies — pools, governors, queues

Every commercial warehouse solves multi-tenancy with some mix of three techniques, and every choice has a leak.

Dedicated compute pools (Snowflake's Virtual Warehouse model). Each tenant gets their own cluster of worker nodes. Fraud's MERGE runs on WH_FRAUD_XL; marketing's dashboards run on WH_MARKETING_S. The two clusters share storage (the Snowflake "FoundationDB-backed" metadata layer plus S3 micro-partitions) but not compute. Why this works: the four shared resources from the previous diagram — CPU, memory, queue slots — are now per-cluster, so fraud cannot pin marketing's worker nodes. The leak is storage I/O: both clusters hit the same S3 bucket, and a heavy scan on the fraud cluster can throttle the bucket-prefix-level S3 quota that the marketing cluster is also reading from. The other leak is cost: an idle WH_MARKETING_S still bills per second it is up. Forty teams, forty warehouses, forty bills.

Resource governors (Redshift's WLM model). One physical cluster, multiple "queues" with declared CPU/memory/concurrency budgets. Fraud's queue gets 60% of the CPU, marketing's gets 20%, ad-hoc gets 20%. The scheduler enforces the split at admission and within-query. The leak is that "60% of the CPU" assumes the planner's memory estimate is accurate; if fraud's MERGE claims it needs 4 GB of memory but actually spills to 40 GB, it will overshoot and starve the others until it spills its way to completion.

Query queues with admission control (BigQuery's slot reservations). A slot is one unit of compute capacity. Each tenant reserves a number of slots; queries waiting for slots queue. If fraud's MERGE needs 1,000 slots and only has 200 reserved, it queues until 800 more become available from idle reservations or the on-demand pool. The leak: a query that has been admitted continues to run until completion regardless of how its slot count balloons. A bad query plan that spawns an extra hash-join stage can monopolise the slots it was admitted with for fifteen minutes.

Why all three leak rather than one being strictly better: the four physical resources (CPU, memory, I/O, queue slots) demand four different control loops, but warehouses only expose two or three. Whichever resource has no explicit control becomes the leak path. Snowflake leaves storage I/O implicit; Redshift WLM leaves memory-overshoot implicit; BigQuery leaves admitted-query-runtime implicit. The choice of leak is the choice of which failure mode you are willing to debug at 3 a.m.

In practice every production warehouse blends all three. Snowflake gives you per-warehouse pools (strategy 1) plus per-warehouse STATEMENT_TIMEOUT and MAX_CONCURRENCY_LEVEL (strategy 3). Redshift's RA3 nodes pool storage but split compute via WLM queues (strategy 2). BigQuery has both reservations (3) and "fair scheduler" CPU allocations within a project (2). The right framing for your platform team is not "which strategy" but "for each of the four resources, which strategy applies, and where does it leak?".

A 60-line admission controller you can run

The cleanest way to internalise admission control is to build the simplest version that still demonstrates noisy-neighbour prevention. The script below simulates a single warehouse with two tenants submitting queries; the controller enforces per-tenant concurrency caps, a global slot pool, and a "fair-share" tiebreaker so that no tenant can monopolise the pool even if it submits more queries than its share.

# admission_controller.py — minimal multi-tenant slot scheduler
import heapq, itertools, random
from collections import defaultdict
from dataclasses import dataclass, field

@dataclass(order=True)
class Query:
    submitted_at: float
    tenant: str = field(compare=False)
    qid: int = field(compare=False)
    slots_needed: int = field(compare=False)
    runtime_s: float = field(compare=False)

def simulate(tenants, total_slots, sim_end):
    """tenants: dict tenant -> {arrival_rate_qps, slots_per_query, runtime_s, max_concurrent}"""
    pending = []                      # min-heap of waiting queries
    running = []                      # heap by finish_time
    per_tenant_running = defaultdict(int)
    used_slots = 0
    qid = itertools.count()
    waits = defaultdict(list)
    rng = random.Random(7)
    t = 0.0
    next_arrival = {n: rng.expovariate(c["arrival_rate_qps"])
                    for n, c in tenants.items()}
    while t < sim_end:
        t = min(min(next_arrival.values()),
                running[0][0] if running else float("inf"))
        if t >= sim_end:
            break
        # 1. completions free up slots
        while running and running[0][0] <= t:
            _, q = heapq.heappop(running)
            used_slots -= q.slots_needed
            per_tenant_running[q.tenant] -= 1
        # 2. arrivals enter the queue
        for n, c in tenants.items():
            if next_arrival[n] <= t:
                heapq.heappush(pending,
                    Query(t, n, next(qid), c["slots_per_query"], c["runtime_s"]))
                next_arrival[n] = t + rng.expovariate(c["arrival_rate_qps"])
        # 3. admit queries that fit and respect per-tenant caps
        deferred = []
        while pending:
            q = heapq.heappop(pending)
            cap = tenants[q.tenant]["max_concurrent"]
            if (used_slots + q.slots_needed <= total_slots
                and per_tenant_running[q.tenant] < cap):
                used_slots += q.slots_needed
                per_tenant_running[q.tenant] += 1
                heapq.heappush(running, (t + q.runtime_s, q))
                waits[q.tenant].append(t - q.submitted_at)
            else:
                deferred.append(q)
        for q in deferred:
            heapq.heappush(pending, q)
    return waits

cfg = {
    "fraud":     {"arrival_rate_qps": 0.5, "slots_per_query": 800,
                  "runtime_s": 30, "max_concurrent": 1},
    "marketing": {"arrival_rate_qps": 4.0, "slots_per_query": 50,
                  "runtime_s": 1.5, "max_concurrent": 8},
}
waits = simulate(cfg, total_slots=1000, sim_end=600)
for t, ws in waits.items():
    p50 = sorted(ws)[len(ws)//2]
    p99 = sorted(ws)[int(len(ws)*0.99)]
    print(f"{t:10s} N={len(ws):4d}  p50_wait={p50:6.2f}s  p99_wait={p99:6.2f}s")
# Sample run:
fraud      N= 280  p50_wait=  0.00s  p99_wait=  3.41s
marketing  N=2391  p50_wait=  0.04s  p99_wait=  1.12s

Walk through what makes this an admission controller rather than just a queue. per_tenant_running is the per-tenant concurrency cap — fraud cannot have more than one query in flight, marketing cannot have more than eight. Without this cap, a burst from one tenant fills the pool. The admission check used_slots + q.slots_needed <= total_slots is the global capacity gate — even if a tenant's per-tenant cap allows another query, the pool must have the slots free. Why both caps and not just one: the per-tenant cap protects against a single tenant submitting 100 queries at once; the global cap protects against the sum of admissible queries across tenants exceeding what the cluster can run concurrently. Either alone is insufficient — per-tenant alone allows 8 marketing × 50 + 1 fraud × 800 = 1,200 slots requested against 1,000 available; global alone allows fraud to grab the whole pool. The pending heap ordered by submitted_at gives FIFO within the queue, but the deferral pattern (while pending: ... else: deferred.append(q)) lets queries that don't fit step aside without blocking later queries that do fit — that is the scheduling trick that makes a small marketing query slip past a queued fraud query without violating fairness. Why this specific deferral instead of strict FIFO: strict FIFO blocks the head of the queue from making progress when its slot demand exceeds free capacity, so later smaller queries pile up behind it even though slots are free. The deferral pattern is the standard way real warehouse schedulers (Snowflake's queue, BigQuery's slot scheduler) handle "head of line" blocking — it is also what cgroup CFS does in Linux. The output shows the consequence of getting these caps right: fraud's heavy queries take their time but never starve, marketing's small queries get a p99 wait of 1.12 seconds even when fraud is hammering. Why p99 rather than p50: p50 is dominated by no-contention runs, p99 is dominated by contention windows. A multi-tenant warehouse is judged by its tail.

The toy is single-process, single-warehouse, FIFO. Production schedulers add: weighted fair-share (each tenant gets a slice of slots proportional to their reservation), priority bands (interactive vs. batch queries get different queues), back-pressure to upstream (when the queue is too long, return RESOURCE_EXHAUSTED to clients rather than queue indefinitely), and preemption (kill a long-running batch query when an interactive query needs slots). The shape — admit, cap-per-tenant, cap-global, defer not block — does not change.

Admission controller flow with two tenantsDiagram showing query arrival, the per-tenant cap check, the global slot pool check, the deferral path back into the pending queue, and the running pool — illustrating how a small marketing query slips past a queued fraud query without violating fairness. Admit, cap-per-tenant, cap-global, defer not block Arrivals fraud + marketing Pending heap FIFO by submitted_at Per-tenant cap? running < max_concurrent Global slots fit? used + need ≤ total Running pool heap by finish_time Deferred list re-pushed to pending Smaller next query slips past head-of-line Solid arrows: admission path. Dashed: deferral path that lets later, smaller queries jump ahead.
The deferral-not-block pattern is what stops a single 800-slot fraud query from blocking a 50-slot marketing dashboard. Real schedulers (Snowflake's queue, BigQuery's slot scheduler, Linux's CFS) all use a variant of this trick.

How real warehouses lay this out

The admission controller above is a textbook simplification of what every commercial warehouse runs as a service.

Snowflake runs one admission controller per virtual warehouse. The "tenant" is the warehouse, not the user — WH_FRAUD_XL and WH_MARKETING_S each have their own queue. A single warehouse can serve multiple users, and its MAX_CONCURRENCY_LEVEL defaults to 8 — eight queries running at once on a warehouse, the rest queued. Setting STATEMENT_QUEUED_TIMEOUT_IN_SECONDS is the back-pressure: a query that waits longer than the timeout fails rather than queues indefinitely, surfacing the contention to the operator instead of silently piling up.

BigQuery runs one fair scheduler per project. The "tenant" is a reservation, and each reservation declares a baseline slot count. Within a reservation, BigQuery's fair scheduler does CPU-proportional sharing — two queries in the same reservation each get half the CPU. Across reservations, a reservation can borrow idle slots from other reservations if idle_slots: true is set, but is preempted when the original reservation owner's load returns. The borrowed-slot model is unique among the big warehouses and is what lets PhonePe's analytics team buy 5,000 reserved slots and burst to 12,000 at month-end.

Redshift WLM runs one admission controller per cluster, partitioning the cluster's CPU and memory into "queues" defined in a JSON config. A query is routed to a queue by SQL pattern, user group, or label. Each queue has a concurrency cap and an optional memory percentage. The system table STV_WLM_QUERY_STATE shows you which queries are admitted, queued, or "returning" — the basic primitive Redshift gives operators for debugging contention.

Databricks SQL warehouses are closer to Snowflake's model — one warehouse per workload, autoscaling clusters under the hood, query queues per warehouse — but with a "Photon" execution engine that provides finer-grained per-query memory caps inside one cluster.

The lesson for a platform team picking among these is not "which is fastest" but "which model best fits your tenant boundary". If teams are well-isolated and willing to pay for dedicated compute, Snowflake's per-warehouse model is mechanically simpler. If teams' workloads are bursty and reservations can usefully borrow from each other, BigQuery's fair scheduler is more efficient. If you need fine-grained queue routing inside one big cluster — say, a single Redshift cluster shared across 25 small dbt projects — WLM's queue config gives you the most operator control at the cost of being the most operator-toil-heavy of the four.

The leaks — failure modes that no warehouse fully prevents

Even with the right configuration, four failure modes recur across every multi-tenant warehouse I have watched in production.

The first is queue head-of-line blocking — one tenant's badly-planned query blocks a slot for fifteen minutes; everyone else's small queries pile up. Snowflake's STATEMENT_TIMEOUT mitigates this by killing the bad query, but only at the cost of the tenant having to retry. BigQuery's fair scheduler avoids it via preemption, but preemption itself is a cost — the killed query has to redo work.

The second is storage I/O leakage — even when compute is isolated, both tenants read from the same S3 bucket and a heavy scan on one virtual warehouse can throttle the prefix-level S3 quota that another warehouse is reading. Snowflake hides this behind their cache, but a cold scan still hits the same bucket. BigQuery's Colossus is sharded enough that this is rare, but spikes have been observed during shared-table scans.

The third is metadata-service contention — every query's planning phase hits a shared metadata service (Snowflake's FoundationDB-backed metadata layer, BigQuery's metadata Spanner, Iceberg's catalog). A burst of queries from one tenant can saturate the metadata service even if their compute pools are isolated. This shows up as elevated plan time rather than execution time, and is the hardest to attribute because dashboards usually only show end-to-end latency.

The fourth is billing as accidental load shedding — when a tenant's compute bill spikes 4× in a day and the platform team kills their warehouse to save money, the tenant's queries stop running. This is "isolation" of a sort, but it is reactive and crude. The teams I have seen do this best treat per-tenant cost ceilings as a soft circuit breaker — a Slack alert at 80% of the daily budget, throttling at 100%, hard kill at 200% — rather than as a hard kill at 100%.

Common confusions

Going deeper

Workload management as the unifying abstraction

Every multi-tenant warehouse implements a variant of workload management (WLM): a routing rule decides which "class" a query belongs to, the class has a budget (concurrency, memory, slot share), and the scheduler enforces the budget at admission and during execution. Redshift names this most explicitly with its WLM config; Snowflake calls it "Resource Monitors plus Warehouses"; BigQuery calls it "Reservations plus Slot Assignments". The unifying primitive is the queue (or "class") and its declared budget. Reading the AWS Redshift WLM documentation end-to-end is the cheapest way to understand all three, because Redshift is the one that names every knob explicitly.

Why per-tenant memory caps are harder than CPU caps

Capping CPU is a scheduler problem — give each tenant a slice of clock time. Capping memory is harder because memory is allocated up-front and reclaimed only on query end. A query that grabs 40 GB of memory cannot be politely asked to give back 20 GB if a higher-priority query arrives — the only way to reclaim memory is to kill the query. This is why preemption is rare in warehouses (BigQuery is the exception), and why most warehouses use admission-time memory checks plus per-query memory caps rather than dynamic memory rebalancing. The Spark community went through the same realisation a decade ago — the unified memory manager in Spark 1.6 fought against this for years before settling on hard per-task memory caps with spill-to-disk as the fallback.

Cross-region and cross-cloud tenancy

Indian fintechs serving RBI-regulated payment data hit an extra constraint: data must remain in a specific region (often ap-south-1, the AWS Mumbai region). A "multi-tenant warehouse" in this context cannot share compute across regions, even logically. Snowflake's region-locked accounts and BigQuery's regional locations both prevent cross-region query routing. This is normally seen as a compliance constraint, but it is also a useful isolation primitive — a tenant whose data is locked to ap-south-1 cannot accidentally noise-up a tenant in us-east-1. Cred and Razorpay both report that region-pinning their fraud-team and merchant-team warehouses separately, by accident of compliance, ended up giving them better isolation than any explicit WLM config.

The economics: why per-team warehouses dominate at Indian scale

At a 50-team Indian platform team, the operational lesson has consistently been: split into per-team or per-domain warehouses (or BigQuery reservations) sooner rather than later. A single shared warehouse with a 25-queue WLM config saves money on paper — better utilisation, fewer idle nodes — but costs an engineer a week per quarter to tune, and any incident on the shared cluster pages the platform team rather than the offending tenant. Per-team warehouses cost 20–40% more in raw compute but eliminate roughly 80% of cross-team incidents and shift cost responsibility to the team owning the warehouse. Razorpay's data platform team published in 2023 that they consolidated to a single Snowflake account with 35 dedicated warehouses (₹3.2 crore/year) rather than one shared XL warehouse (₹2.1 crore/year), explicitly trading 50% more compute spend for the elimination of cross-team contention pages. The math works out because one engineer-month of incident response is roughly ₹1.5–2 lakh — 6 incidents avoided per quarter pays for the extra compute.

Where this leads next

References