Query cost attribution

On the third of every month, the platform team at a Bengaluru fintech receives the previous month's Snowflake invoice — ₹47 lakh in April, up from ₹31 lakh in March. The CFO asks the obvious question: which team spent the extra ₹16 lakh, and why? The platform team has the raw QUERY_HISTORY view — every query, its warehouse, its credits used, the user who submitted it — and a tagging convention that the marketing team follows but the fraud team mostly ignores. The data analyst spends three days writing SQL that joins query history to a tag map, fills in the gaps with heuristics, divides by warehouse uptime, and produces a spreadsheet that the fraud lead disputes line by line. The CFO gets the report two weeks late and stops trusting it. Next month it will happen again unless the attribution is built, not improvised.

Query cost attribution is the inverse of multi-tenant scheduling: the scheduler asks "who gets the slot", attribution asks "who paid for the slot they got". The hard part isn't the SQL — it's tagging discipline, idle-warehouse cost spreading, and the fact that a query touches storage, compute, and metadata in different proportions. Done right, you get an auditable per-team invoice and a feedback loop that stops the next ₹16 lakh surprise.

What "cost" actually means inside a warehouse

A warehouse vendor presents one number on the invoice — credits used, slot-seconds consumed, ₹ billed. Underneath, that number is the sum of three very different resources, and any honest attribution model has to allocate each of them separately.

The first is compute time — credits or slot-seconds spent running queries. This is the dollar majority on Snowflake (typically 70–85% of the bill) and almost the entire bill on BigQuery on-demand pricing. It is also the easiest to attribute, because every query has a warehouse, a duration, and either a known credit rate or a known slot count.

The second is storage — the bytes sitting in S3/GCS, the snapshots Iceberg keeps for time travel, the backups Snowflake retains for Fail-safe. Storage is usually 5–15% of the bill but it is the trickiest to attribute because tables live in databases shared by multiple teams; one team writes, ten teams read, and storage doesn't care who is reading.

The third is metadata and serverless services — Snowflake's Cloud Services layer, BigQuery's metadata operations, Iceberg catalog calls. This is normally 5–10% of the bill, billed as "background" cost that no single query "caused", and it is where most attribution models give up and just pro-rate.

Each cost surface needs a different attribution rule. Conflating them — pro-rating compute by storage share, or attributing storage by query volume — is how teams end up with invoices the fraud lead can dispute line by line.

The trap most attribution v1s fall into is treating the warehouse bill as one number and dividing it by some proxy — share of queries, share of users, share of departments — and calling it done. The proxy will always be wrong on at least one of the three surfaces, and the team that gets over-billed will rightly object. Why three different rules: compute scales with what you ran, storage scales with what you keep, metadata scales with how often the planner is invoked. They are governed by different numbers in the warehouse's billing engine, and the only attribution that survives a CFO review is one that follows the same split.

Tags, tags, tags — the boring problem that decides everything

Every attribution model collapses without tags. The single highest-leverage technical decision your platform team will make is enforcing a tag at submission time. Snowflake exposes this as session parameters and QUERY_TAG; BigQuery uses labels; Databricks uses usage_metadata.custom_tags; Redshift uses query_group. The mechanism varies; the discipline is the same: every query enters the warehouse with a team, a pipeline, and an env, or it does not enter the warehouse.

The right way to enforce this is at the connection layer, not at the query layer. If your dbt project sets query_tag in a model, a developer can override it; if your Airflow operator stamps the tag from the DAG's team field, every task in the DAG inherits it without thinking. The same applies to your BI tool (Looker can pass user attributes to BigQuery as labels), your notebook templates (a corporate Jupyter image that wraps snowflake.connector.connect and injects QUERY_TAG = json.dumps({"team": os.getenv("OWNER_TEAM"), "tool": "jupyter"})), and your ad-hoc SQL UI (Hex, Mode, Querybook all support per-workspace defaults).

The teams I have watched do this well treat untagged queries as a bug and have a dashboard that shows % of compute spend on untagged queries over time. A healthy platform sits at under 2% untagged within six months; a struggling one stays above 15% indefinitely because the platform team never made tagging the path of least resistance.

Building the attribution: a runnable Python prototype

The attribution itself is straightforward arithmetic over the warehouse's query log; what matters is doing the arithmetic without losing money in the gaps. The script below takes a Snowflake-shaped query history, a tag map (filling in known untagged queries via heuristics), and a list of warehouse uptime windows, and produces the per-team invoice.

# attribute.py — minimal multi-tenant warehouse cost attribution
import csv, json
from collections import defaultdict
from datetime import datetime

# 1. Load query history (one row per query the warehouse executed last month).
#    Columns: query_id, start_ts, end_ts, warehouse, user, role, query_tag,
#             credits_used, bytes_scanned.
queries = list(csv.DictReader(open("query_history_april.csv")))

# 2. Load the team-membership map. user/role -> team. Falls back by role.
team_of_user = json.load(open("user_team_map.json"))
team_of_role = json.load(open("role_team_map.json"))

# 3. Load warehouse credit prices (₹/credit).
PRICE = {"WH_FRAUD_XL": 290.0, "WH_MARKETING_S": 230.0, "WH_GROWTH_M": 260.0,
         "WH_ML_L": 280.0, "WH_FINANCE_S": 230.0, "WH_SHARED_M": 260.0}

def parse_tag(t):
    try:
        return json.loads(t) if t else {}
    except Exception:
        return {}

def team_for(q):
    tag = parse_tag(q.get("query_tag"))
    if tag.get("team"):
        return tag["team"]
    if q["user"] in team_of_user:
        return team_of_user[q["user"]]
    if q["role"] in team_of_role:
        return team_of_role[q["role"]]
    return "untagged"

# 4. Sum compute spend per team. credits_used × ₹/credit.
team_compute = defaultdict(float)
team_runtime_s = defaultdict(float)
total_compute = 0.0
for q in queries:
    team = team_for(q)
    cred = float(q["credits_used"])
    cost = cred * PRICE.get(q["warehouse"], 250.0)
    team_compute[team] += cost
    total_compute += cost
    runtime = (datetime.fromisoformat(q["end_ts"])
               - datetime.fromisoformat(q["start_ts"])).total_seconds()
    team_runtime_s[team] += runtime

# 5. Attribute idle warehouse time. WH_SHARED_M was up 720 hours but only ran
#    queries for 410 hours; the 310-hour idle gap costs ₹2.6 lakh and must
#    land on the teams that *kept it warm*. Spread by their share of runtime.
IDLE_COSTS = {"WH_SHARED_M": 260000.0}
for wh, idle_cost in IDLE_COSTS.items():
    wh_qs = [q for q in queries if q["warehouse"] == wh]
    rt_by_team = defaultdict(float)
    for q in wh_qs:
        rt_by_team[team_for(q)] += (
            datetime.fromisoformat(q["end_ts"])
            - datetime.fromisoformat(q["start_ts"])).total_seconds()
    total_rt = sum(rt_by_team.values()) or 1.0
    for team, rt in rt_by_team.items():
        team_compute[team] += idle_cost * (rt / total_rt)

# 6. Storage. Read TABLE_STORAGE_METRICS, look up table owner team, charge
#    bytes × ₹/TB/month. Pseudo-data for brevity.
table_storage = {"fraud.card_tx": (12.4e12, "fraud"),
                 "marketing.events": (3.1e12, "marketing"),
                 "growth.cohorts": (1.8e12, "growth"),
                 "ml.features": (2.6e12, "ml-platform")}
RATE_PER_TB = 1820.0  # ₹/TB/month
for tbl, (b, owner) in table_storage.items():
    team_compute[owner] += b / 1e12 * RATE_PER_TB

# 7. Metadata + Cloud Services — pro-rate by share of compute.
META_TOTAL = 290000.0
for team in list(team_compute.keys()):
    team_compute[team] += META_TOTAL * (team_compute[team] / sum(team_compute.values()))

# 8. Print the invoice.
total = sum(team_compute.values())
for t, c in sorted(team_compute.items(), key=lambda x: -x[1]):
    print(f"{t:14s}  ₹{c:14,.0f}  ({100*c/total:5.1f}%)")
print(f"{'TOTAL':14s}  ₹{total:14,.0f}")

# Sample run on April query history:
fraud           ₹  19,21,440  ( 40.9%)
marketing       ₹  11,76,200  ( 25.0%)
growth          ₹   7,38,510  ( 15.7%)
ml-platform     ₹   4,52,800  (  9.6%)
finance         ₹   2,11,330  (  4.5%)
untagged        ₹   2,00,720  (  4.3%)
TOTAL           ₹  47,01,000  (100.0%)

Walk through the four parts that decide whether this attribution holds up in a CFO review. team_for(q) is the tag waterfall: try the explicit tag first, then map by user, then by role, then admit defeat with "untagged". Why a waterfall and not strict tag-only: in any real warehouse, 5–20% of queries are untagged because of dashboards built before the tagging policy, ad-hoc analyst sessions, or external tools (Tableau extracts, Looker scheduled reports) that don't propagate tags. A user/role fallback recovers most of those without lying — you map the actual identity, not a heuristic guess. The idle-warehouse spread (step 5) is the part most attribution v1s skip. A warehouse that auto-suspends after 5 minutes still costs you for the 5 minutes it waited; a shared warehouse that stayed up 720 hours but only ran queries for 410 hours has 310 hours of idle that nobody "ran" but somebody must pay for. The right rule is to charge the idle to the teams that kept the warehouse warm — i.e., spread by their share of active runtime. Why share of runtime and not share of queries: a team running 1 query that took 30 minutes contributed more "warming pressure" than a team running 100 queries that finished in 100 ms each. Runtime captures the shape of demand the warehouse was sized for; query count does not. The storage step charges the owner team, not the reader team. A table fraud writes is fraud's storage cost even if marketing reads it 50 times a day; the storage bill exists because fraud chose to keep the bytes. The reader's compute cost is already captured in step 4. Why owner-pays and not pro-rated by reads: the storage bill exists whether anyone reads or not — keeping a 12 TB table for a year costs the same if 0 queries scan it or 10 million do. Charging readers would make storage costs disappear when nobody queries, which is exactly backwards. The metadata pro-rate (step 7) is the honest-give-up: nobody can attribute a INFORMATION_SCHEMA.TABLES query the planner runs internally, so spread the metadata cost by share of compute spend and document the rule. The output respects the three-surface split — and the untagged row at 4.3% is the platform team's accountability target.

The toy is single-warehouse and single-month. Production attribution adds: per-day rollup so you can chart cost trend, per-pipeline attribution (the pipeline tag inside query_tag lets you say "fraud's daily_features pipeline cost ₹4.2 lakh in April"), shared-table attribution (option to split storage by owner+reader weights for analytics platform tables), and reservations (BigQuery reservations are billed against the project, not the query, so the attribution unit there is reservation-share rather than per-query credits).

How real warehouses expose the data you need

Each warehouse vendor exposes the same three numbers — query history, storage, metadata — but the column names and update lag differ enough that the SQL you write for one rarely ports to another.

Snowflake gives you SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY (every query, every credit used; updated within 45 minutes) and SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY (per-warehouse credit consumption including idle; updated hourly). Storage is TABLE_STORAGE_METRICS. Metadata is METERING_DAILY_HISTORY with service_type = 'CLOUD_SERVICES'. The catch: the views are eventually consistent and lag the live warehouse by up to 3 hours, so anything claiming "real-time cost dashboards on Snowflake" is reading a different (more expensive) view called INFORMATION_SCHEMA that costs you compute to query.

BigQuery gives you INFORMATION_SCHEMA.JOBS_BY_PROJECT (every job, every byte scanned, every slot-millisecond), labels via labels column (a JSON array of {key, value} pairs), and storage via INFORMATION_SCHEMA.TABLE_STORAGE. The on-demand pricing model — bytes scanned × ₹/TB — makes attribution for that pricing tier dead simple: each query has a definite cost the moment it finishes. The reservation pricing model — slot-seconds × ₹/slot-second — makes it harder, because slot consumption is a property of the project's reservation, not the individual query. Most BQ shops attribute reservations the way the script above attributes idle warehouses: by share of slot-seconds within a day.

Databricks exposes system.billing.usage (one row per usage record, with usage_metadata.cluster_id, usage_metadata.job_id, usage_metadata.custom_tags, etc.) and system.access.audit for query-level events. The custom_tags propagate from the cluster's tag config or from the SQL warehouse's tags field; the attribution model is the cluster, not the query, so a single SQL warehouse with five teams sharing it requires per-query tagging discipline.

Redshift exposes STL_QUERY (every query) and SVL_QUERY_METRICS_SUMMARY (per-query CPU, memory, IO). Cost attribution on Redshift is harder because the cluster is a fixed-price node-hour bill; "share of bill" is genuinely just "share of compute time", with no per-query price. Most Redshift shops report cost per workload manager queue rather than per query, because the queue is the unit the WLM scheduler actually budgets.

The lesson across vendors is that the underlying log (one row per query, with team-identifying fields and a usage measure) is universal; the SQL on top is per-vendor; and the tagging discipline that produced the team-identifying fields is universal again. Get tagging right, and porting attribution from Snowflake to BigQuery is a week's work. Get tagging wrong, and no SQL will save you.

The four leaks attribution always has

Even with discipline, four leaks recur across every cost-attribution implementation I have watched.

The first is untagged queries. The 2–5% of queries that escape the tagging waterfall add up to ₹1–2 lakh on a ₹50 lakh bill, which is enough to start a turf war. The right fix is structural: a Slack notification that pings the user every time they submit an untagged query, plus a weekly leaderboard of "top 5 users with highest untagged spend" sent to engineering managers. Public accountability moves the number faster than dashboards do.

The second is shared tables. The growth team's dim_user table is read by 300 queries a day across 9 teams; the storage cost lives on growth, but the value lives on the readers. Some platform teams charge shared-table storage proportionally to read-count (a "transfer-pricing" approach that approximates the value flow); others leave it on the owner and rely on chargeback at the budgeting layer. Both work; the wrong answer is to surprise the growth team with a 12 TB storage bill they didn't budget for because they were not the ones who needed the table kept.

The third is idle warehouse cost on auto-suspend tails. Snowflake's auto-suspend defaults to 60 seconds; a warehouse that cycles 200 times a day has 200 minutes of "warm but idle" billed time. Attributing this to the team that triggered the resume is the honest rule, but most attribution scripts attribute by share of active runtime, which under-charges teams that submit one query at a time and over-charges teams that batch their queries. The fix is to tighten auto-suspend (5–10 seconds for interactive warehouses) so the leak is small enough that the attribution rule doesn't matter.

The fourth is cloud-services credits on Snowflake. Snowflake bills for the compute its planner uses to plan queries; complex queries use more cloud-services credits than simple ones, and a team running thousands of INFORMATION_SCHEMA queries (often a BI tool's metadata refresh) can rack up cloud-services credits without any visible "compute" usage. The safe attribution is to charge cloud-services to the same team that owns the queries' regular compute (which the script does), but the deeply right attribution is to track the cloud-services credits per-query — and Snowflake exposes this only at hourly granularity, so the scripts have to bucket queries to hours and re-divide.

A practical heuristic: when the four leaks together account for less than 5% of the bill, ship the attribution and move on; when they account for more than 10%, the leaks themselves are the bug worth fixing before the next attribution run. The Bengaluru fintech from the lead opened April at 12% leak (untagged + idle dominant) and closed June at 3% — the difference was three weeks of tag-discipline work, not three months of attribution-engineering work.

Common confusions

"Per-team budgets fix the attribution problem." They give you a cap, not an audit. A team that hits 95% of budget on day 28 still owes you an attribution that explains where the 95% went. Without query-level attribution, the team's only response to a budget alert is "I don't know what queries cost that, just give me more budget".
"Bytes scanned and credits used measure the same thing." They don't. A query that scans 1 TB but does heavy hashing on it can cost more credits than a query that scans 5 TB but does a simple aggregation. Bytes-scanned is a proxy for cost on BigQuery on-demand only; on Snowflake, BigQuery slots, and Databricks, it can be off by a factor of 10.
"Attribution is a SQL problem." The SQL is 200 lines; the discipline that makes the SQL work is 6 months of organisational change. Tag adoption, role mapping, table ownership conventions — these are the actual deliverables. The SQL is the easy part.
"Untagged queries should be charged to the platform team." Then the platform team owns a cost it cannot control; teams will silently drop tagging because the platform absorbs the cost. The right rule is to charge untagged queries back to the user's default team (from HR data) and only fall back to "platform" if the user is a service account no team owns.
"Cost attribution is just FinOps." FinOps is the discipline of running cost optimisation as an organisation; attribution is one (foundational) tool in that toolkit. You can do attribution without FinOps governance, but FinOps without attribution is theatre — the team budgets land on whoever shouts loudest, not on whoever spent the money.
"Real-time attribution is better than batch." It is more expensive and rarely useful. Batch attribution refreshed daily catches a runaway query within 24 hours, which is fast enough for budget alarms; real-time attribution requires querying live INFORMATION_SCHEMA views on Snowflake or streaming JOBS_BY_PROJECT on BigQuery, both of which are expensive enough to show up on the very attribution report you're trying to produce.

Going deeper

From attribution to chargeback to showback

Three escalating levels: showback (every team can see their cost on a dashboard, no money changes hands), chargeback (the warehouse bill is split and each team's cost-centre is debited internally), full chargeback with budget enforcement (a team that exceeds budget has their warehouse throttled or paused). The progression matters because the SQL is the same at every level — what differs is the organisational machinery on top. Teams new to FinOps usually start at showback, move to chargeback after the first budget-blowout incident, and only adopt enforcement after the second. Razorpay published in 2023 that their data platform sits at chargeback-without-enforcement: every team sees their bill, the cost-centre debit is automatic, but the platform team will not pause a fraud team's query at month-end because the cost of a fraud miss is bigger than the cost of overage.

Pipeline-level attribution and the dbt model graph

Once the team tag works, the next question is "which pipeline cost the team ₹4 lakh"? The attribution unit is the dbt model (or Airflow task, or Dagster op). Snowflake's query_tag can carry {"team":"fraud","pipeline":"daily_features","model":"fct_card_tx"}, and the attribution roll-up groups by pipeline and model. This is what lets a fraud lead say "the new feature velocity_window_24h doubled our compute cost and is responsible for ₹2 lakh of the increase" — which is the granularity that drives optimisation decisions. Without per-model attribution, the fraud lead can only say "fraud cost more this month" and has no lever to pull.

Marketplace shared data and the "who pays for the join" problem

When the growth team builds dim_user and the marketing team joins to it 50 times a day, the marketing query scans the dim table — so the storage I/O cost is on the growth bucket, but the compute cost is on the marketing warehouse. The simple rule above charges marketing for the compute and growth for the storage, which is right. The complication arises with cross-account data shares (Snowflake's secure data sharing, BigQuery's authorized views, Iceberg's REST catalog access): the readers' queries don't even hit the writers' warehouse, but they do hit the writers' storage I/O quota. Splitting that bill requires either bilateral agreements or a centralised data marketplace billing layer; most organisations don't bother for internal sharing and only formalise it for cross-org sharing.

What attribution looks like at Indian payments scale

At PhonePe (10 billion+ UPI transactions/year, ₹50+ lakh/month BigQuery spend reportedly), attribution is built into the platform: every dbt model has a meta.team field, every Airflow DAG has a team tag, and the cost dashboard updates daily from INFORMATION_SCHEMA.JOBS_BY_PROJECT. The platform team publishes the previous day's per-team cost by 10 a.m., and any team that sees an unexpected ₹50,000 spike has 24 hours to investigate before the next CFO review cycle starts. Razorpay runs a similar discipline on Snowflake. The pattern is the same across both: tagging at the connection layer, attribution as a daily batch, escalation in 24-hour cycles, and a "cost owner" rotation on each team that handles the investigation. The discipline is unsexy and works.

Where this leads next

/wiki/multi-tenant-warehouses-isolation-and-noisy-neighbours — the previous chapter. Attribution is the inverse problem of isolation.
/wiki/compute-storage-separation-for-cost-control — the architectural decoupling that makes per-team attribution affordable.
/wiki/query-acceleration-and-result-caching — the next optimisation lever once you know who is spending what.
/wiki/data-mesh-decentralization-as-a-governance-pattern — pushing attribution down to data products as the unit of ownership.

References

Snowflake, "Account Usage views" — QUERY_HISTORY, WAREHOUSE_METERING_HISTORY, TABLE_STORAGE_METRICS.
Google Cloud, "INFORMATION_SCHEMA.JOBS_BY_PROJECT" — the canonical BigQuery cost source.
Databricks, "System tables: billing usage" — the system.billing.usage table and tag propagation rules.
FinOps Foundation, "FinOps Framework" — showback/chargeback definitions and maturity model.
Densmore, "Snowflake cost attribution" — practitioner walkthrough of the Snowflake tag waterfall in production.
Razorpay Engineering, "Scaling our data platform" — public talks describing the per-team chargeback rollout.
/wiki/multi-tenant-warehouses-isolation-and-noisy-neighbours — internal previous chapter, the scheduling counterpart of attribution.
/wiki/data-contracts-the-producer-consumer-boundary — internal chapter; tagging is itself a data contract between producers (query submitters) and consumers (the attribution pipeline).