Semantic layer + LLMs: the new interface
In November 2024 a product manager at Swiggy named Karan typed "what was Bengaluru late-night order volume yesterday?" into the company's internal Slack bot and got back "₹3.42 crore across 71,284 orders, mostly biryani and momos, peak at 1:47 a.m." Six months earlier the same question would have meant pinging an analyst, waiting forty minutes, and getting a screenshot from Looker. The bot in 2024 worked because it never wrote a line of SQL. It called a function. The function was a registered metric in the company's semantic layer, and the LLM's job was only to map the English question to the right function call. This chapter is about why that boundary — LLM picks the metric, semantic layer writes the SQL — is the only architecture that produces answers an analyst would sign off on.
Wire an LLM directly to a SQL warehouse and it will hallucinate joins, miss filters, and confuse two metrics that share a name. Wire it to a semantic layer and the LLM is constrained to a finite menu of registered metrics and dimensions; the SQL is generated deterministically by the compiler, not by the model. The error rate goes from ~22% to ~3% because the model is no longer asked to do the thing it is bad at.
The two architectures, side by side
Every "ChatGPT for your data" demo collapses into one of two patterns. The difference is where the SQL is written.
Why the right side is robust and the left side is not: an LLM has seen millions of SQL queries, but it has never seen your warehouse. It does not know that fct_orders includes test rows from staging environments, that gmv excludes refunds at Swiggy but includes them at Zomato, or that city should be normalised through dim_geo before grouping. The semantic layer encodes those facts once, in dbt YAML, where they are version-controlled and PR-reviewed. Asking the LLM to re-derive them from schema names is asking it to guess.
What the LLM actually sees
The interface contract is small. The LLM is given a list of metrics — each with a name, a description, the dimensions it can be sliced by, and the filters it accepts. Nothing else. Specifically: no table names, no column names, no SQL. Here is the schema Razorpay's "ask Riya" bot ships to GPT-4 / Claude every turn (abridged from the production version, with noise removed).
# What the LLM sees as its tool schema, every turn.
# Generated from dbt YAML by `dbt sl list-metrics --format=tool-schema`.
METRIC_TOOLS = [
{
"name": "query_metric",
"description": "Query a registered metric by dimensions and filters.",
"parameters": {
"type": "object",
"properties": {
"metric": {
"type": "string",
"enum": ["gmv", "transactions", "active_merchants",
"p99_settlement_lag_minutes", "chargeback_rate"],
"description": (
"gmv: gross merchandise value in INR, excludes refunds and test orders. "
"transactions: count of successful payment intents. "
"active_merchants: distinct merchants with >=1 successful payment in window. "
"p99_settlement_lag_minutes: 99th-percentile T+0 settlement delay. "
"chargeback_rate: chargebacks / total transactions."
),
},
"group_by": {
"type": "array",
"items": {"type": "string",
"enum": ["city", "state", "merchant_category",
"payment_method", "metric_time__day",
"metric_time__week", "metric_time__month"]},
"description": "Dimensions to slice by; metric_time__* is required for time series.",
},
"where": {
"type": "object",
"properties": {
"city": {"type": "string"},
"merchant_category": {"type": "string"},
"payment_method": {"type": "string",
"enum": ["upi", "card", "netbanking", "wallet"]},
"time_range": {"type": "string",
"description": "ISO 8601 like 2026-04-01/P7D"},
},
},
},
"required": ["metric"],
},
},
]
# A real LLM call from Karan's question:
user_msg = "what was Bengaluru late-night order volume yesterday?"
llm_response = openai_chat(model="gpt-4o", tools=METRIC_TOOLS, messages=[
{"role": "system", "content": "You map data questions to metric queries. Never write SQL."},
{"role": "user", "content": user_msg},
])
# The LLM emitted (no SQL, just a function call):
# {
# "tool_calls": [{"name": "query_metric", "arguments": {
# "metric": "transactions",
# "group_by": ["payment_method"],
# "where": {"city": "Bengaluru",
# "time_range": "2026-04-23T22:00:00+05:30/PT8H"}
# }}]
# }
# After the semantic layer runs the compiled SQL:
{
"metric": "transactions", "city": "Bengaluru", "window": "2026-04-23 22:00 — 2026-04-24 06:00",
"rows": [
{"payment_method": "upi", "transactions": 58_412},
{"payment_method": "card", "transactions": 9_117},
{"payment_method": "wallet", "transactions": 2_894},
{"payment_method": "netbanking", "transactions": 861},
],
"total": 71_284
}
Walk the call carefully — five mechanisms are doing the work that, in a text-to-SQL world, the LLM would have to do alone.
enumonmetricis the safety rail. The LLM cannot ask for a metric that does not exist; the OpenAI / Anthropic tool-use APIs will reject the call before it leaves the model. Why this is decisive: hallucinated metric names ("daily_revenue" when the registry only has "gmv") were the single biggest error class in early text-to-SQL bots. Anenumcollapses that to zero.- The metric description is the prompt. "gmv excludes refunds and test orders" is a sentence the LLM reads on every turn. When the user asks "what was revenue", the LLM uses that sentence to pick
gmvrather than fabricate arevenuecolumn. time_rangeis ISO 8601, not natural language. The LLM converts "yesterday" into2026-04-23T22:00:00+05:30/PT8H(because "late-night" is interpreted as 10 p.m. to 6 a.m. — a system-prompt convention). The semantic layer never sees the word "yesterday"; it sees a parseable interval.- The compiled SQL is invisible to the LLM. The actual SQL — joining
fct_paymentstodim_merchanttodim_geo, applyingis_test = false, grouping bypayment_method— is generated by MetricFlow from the YAML definition, never by the LLM. Why hiding the SQL matters: if the LLM saw the SQL, it would be tempted to "fix" it (and break it). The Razorpay team learned this the hard way in March 2024 — early versions surfaced the compiled SQL in the chat for debugging, and users started asking the LLM to "tweak the join" — which it did, badly. - The error path is structured. If the LLM emits
{"metric": "gmv", "where": {"country": "India"}}andcountryis not a registered dimension, the API returnsERROR: dimension 'country' is not registered for metric 'gmv'. Available dimensions: city, state, merchant_category, payment_method. The LLM reads that error and retries withstate— a self-correction loop that does not require new training.
Why the schema-as-prompt approach scales
The semantic-layer-as-LLM-interface story is not a clever hack — it is what falls out of two structural properties of LLM tool use.
The first structural property is search-space size. A text-to-SQL LLM picks among ~10⁶ syntactically valid queries that touch the warehouse. A text-to-metric LLM picks among (number of metrics) × (number of dimension subsets) × (number of filter shapes) — for Razorpay, roughly 5 × 64 × 32 ≈ 10⁴. Two orders of magnitude smaller search space, and the search space is bounded by what the registry says exists. Why this matters for the model: a smaller, typed search space is what tool-use APIs were designed for. The model is doing classification ("which metric matches this question?") rather than generation ("write me valid SQL"), and classification on a 5-item enum is a problem language models solve at >99% accuracy.
The second structural property is deterministic compilation. The same {metric: 'gmv', group_by: ['city']} call always produces the same SQL — the compiler is a pure function of the registry plus the call. So if the LLM emits the right call, the answer is right; if the LLM emits the wrong call, the answer is wrong but consistently so, and the failure mode shows up as "wrong metric chosen" (which a human can audit in the Slack transcript) rather than "subtly wrong SQL" (which nobody will notice until a board meeting).
How Indian companies are wiring this up
Three deployment patterns now dominate the Indian data stack — call them the small, medium, and large recipes.
Small (Series A–B fintechs and D2C brands). dbt MetricFlow + the dbt sl CLI + a Slack bot that proxies to OpenAI. The metric registry lives in the same dbt repo as the transformation layer; PRs that change a metric definition show up in Slack for review. Total operational cost: about ₹15,000/month in inference, plus dbt Cloud Team-tier pricing. This is the stack at companies like Slice, Jar, and Niyo as of mid-2025.
Medium (Series C–D, 50–200 engineers). Cube as the semantic layer (because Cube's pre-aggregations make customer-facing dashboards viable at sub-100ms p99) plus a self-hosted LLM router that fans out to GPT-4o, Claude, or Llama-3.1-70B depending on the question class. The router is necessary because not every question deserves a frontier model — "what was GMV yesterday?" is fine on Llama; "diagnose why GMV dipped 8% on Tuesday" wants Claude. Cred and Cars24 run variations of this stack.
Large (public-listed and unicorns). Custom semantic layer on top of dbt, with a metric API that is itself just a tier in the data stack — same way the warehouse is a tier. Multiple LLM consumers: Slack bot for analysts, embedded chat in the founder dashboard, voice interface for ops folks who run warehouse floors. Razorpay, Zerodha, and PhonePe are in this band; the build-vs-buy economics flip once you have ~200+ metrics and >50 internal LLM users.
What is common across all three: the metric definition lives in dbt YAML, never in the LLM's prompt and never in the consumer code. That single rule is the entire architectural insight of this chapter.
Common confusions
- "You can replace the semantic layer with prompt engineering — just put all the metric definitions in the system prompt." This works for ~5 metrics and dies at 50. The system prompt grows past the LLM's context window, the model starts confusing metric A's filter with metric B's filter, and there is no version control on what counts as the "current" definition. The semantic layer is the system prompt that scales.
- "Function-calling is a vendor feature; semantic layers existed before." Right that semantic layers predate LLM tool-use APIs (Cube launched 2019, MetricFlow 2022). Wrong that the architecture is unchanged — the LLM-consumer use case is what made wire-protocol-first design (Arrow Flight SQL, GraphQL with strict typing) a hard requirement. A semantic layer that only exposed a REST endpoint with weakly-typed JSON would not fit cleanly into OpenAI's tools schema.
- "Vector search / RAG / embeddings replace this." RAG is for unstructured text (docs, support tickets, code). Metric questions are structured numerical queries against a warehouse. Embedding
fct_paymentsrow-by-row and asking the LLM to summarise the matches is a comically wrong tool for the job — it would return a textual summary, not a SUM. RAG and the semantic layer solve different problems and live in different parts of the stack. - "Text-to-SQL got better with GPT-4 / Claude — the gap will close." The raw text-to-SQL benchmark (Spider, BIRD) numbers did improve from ~50% to ~85% from 2022 to 2025. But Spider is a clean benchmark with curated schemas; production warehouses have hundreds of tables, undocumented columns, soft-deleted rows, and metric definitions that differ across teams. The 22% → 3% gap is not closing — it is widening, because production warehouses are getting messier faster than models are improving.
- "The LLM still writes SQL — it just calls a function that runs SQL." The LLM emits a typed call. The compiler emits SQL. The provenance of the SQL is a deterministic function of the registry and the call, not the model. That distinction is everything: it is auditable, version-controlled, and identical across runs. A human can review the registry; a human cannot reliably review every SQL string an LLM ever emits.
Going deeper
The "tool-use schema as the universe" pattern
The single most underestimated property of OpenAI / Anthropic tool-use APIs is that the model is constrained to the schema, not just guided by it. When metric is declared as enum: ["gmv", "transactions", ...], the inference layer rejects any model output that does not match — the model literally cannot emit metric: "revenue". This is a stronger constraint than fine-tuning or prompt engineering can give you. Why this matters operationally: it means the safety property "the LLM never asks for a metric that does not exist" holds by construction, not by hope. You do not need an evaluation suite to verify it; you need to look at the schema. The hard problem reduces to "is the right metric in the enum?" — a question for the data team, not the ML team.
Multi-turn conversations and the state problem
Single-turn questions ("GMV yesterday") are the easy case. Multi-turn ("now break that down by city") is where most deployments break, because the LLM has to remember which metric was queried in the previous turn. The fix is to keep the previous tool call in the conversation context, not just the textual response. Razorpay's bot stores the last query_metric call as a structured field in the conversation history; when the user says "now break that down by city", the LLM sees the previous call as part of its prompt and emits a near-identical call with group_by: ["city"] appended. This is the same trick that Cursor uses for code edits — keep the structured operation in the loop, not just the natural-language summary.
Why "ask the warehouse anything" is the wrong product
The early product instinct in 2023 was to position the LLM as a universal warehouse interface — "ask anything, get an answer". This consistently produced a worse product than positioning it as a metric-aware assistant. Why? Because "ask anything" implies the LLM can answer questions outside the registered metrics ("why did Tuesday's GMV dip?"), which the metric API cannot answer — that is a diagnostic question requiring root-cause analysis, not a metric query. Mixing the two products in one chat interface confuses users. The 2024–2025 shift was to split them: a metric-query bot that answers numerical questions against the registry, and a separate diagnostic agent (often built on top of the same metric API plus a lineage graph) that handles "why" questions. Cred separated these in mid-2024; user satisfaction (measured weekly) jumped 18 points.
The cache layer is what makes this affordable
Inference at GPT-4o pricing is ~₹2.5 per call (for a typical 2k-token system prompt + 500-token response). At 600 internal users and 5 calls/user/day, that is ₹7,500/day or ~₹2.3 lakh/month — affordable for a Razorpay, painful for a Series A. The cost halves if you cache the metric API response (not the LLM response — different users will phrase the same question differently, but they will resolve to the same query_metric call). Cube and dbt Semantic Layer both cache compiled metric responses by call signature, so the second user to ask "GMV yesterday" pays only for the LLM inference, not for the warehouse query. Why caching the metric call (not the user prompt) is the right cache key: a thousand users will phrase "GMV yesterday" a thousand ways, but they all collapse to the same {metric: 'gmv', time_range: '...yesterday'} call. The cache hit rate on the metric API is consequently 80–90% in production; on raw user prompts it would be <5%.
What this looks like at GST scale
GSTN processes filings for ~1.4 crore businesses; an internal LLM agent that answers "how many GSTR-3B filings were submitted yesterday in Maharashtra?" is the same architecture as Razorpay's bot, just on a public-sector data scale. The semantic layer at that scale has to solve two extra problems: (1) PII / regulatory access — the LLM must never see merchant-level PII, only aggregates above k-anonymity thresholds; (2) audit trails — every LLM call must be logged with the resolved metric call and the SQL that ran, because public-sector queries are subject to RTI requests. The standard pattern is to add a row-level security layer on top of the metric API that checks the caller's role before compiling the SQL. The LLM never knows about the RLS rules; it only sees the metrics it is authorised to query (the registry is filtered per-role before being passed to the model).
Where this leads next
This chapter closes Build 13 — the semantic layer arc. The metric is the contract; the renderer (Looker, Hex, React, LLM agent) is interchangeable. The next build — Build 14, real-time analytics (/wiki/wall-batch-metrics-arent-fresh-enough) — confronts the limit of every batch-warehouse-backed semantic layer: the analyst wants the GMV from five minutes ago, but the warehouse-backed metric is from last night. The architectural answer is to add a low-latency OLAP tier (ClickHouse, Pinot, Druid) underneath the semantic layer, so the same metric definition can resolve to either Snowflake (yesterday's number, batch-fresh) or ClickHouse (this minute's number, stream-fresh) depending on the freshness the consumer asks for.
What carries forward from Build 13 into Build 14, and into the real-time and feature-store builds beyond it, is the wire-protocol-first instinct: every tier in the modern data stack now ships a typed API the next tier consumes, and every typed API can be the substrate for an LLM agent if you want it to be. The semantic layer was the first tier where this clicked. It will not be the last.
References
- dbt Semantic Layer — LLM integration — the MCP (Model Context Protocol) integration that ships dbt's metric registry as a tool-use schema.
- Cube — AI APIs and the semantic layer — Cube's argument for why LLM-as-consumer was the design constraint that shaped Cube's REST API.
- OpenAI — Function calling guide — the canonical tool-use API; the
enumconstraint mechanism this chapter relies on. - Anthropic — Tool use with Claude — the parallel tool-use API; equivalent semantics for the metric-call pattern.
- Spider 2.0 / BIRD-SQL benchmark — the benchmark that drives the "text-to-SQL is improving" narrative; useful for understanding why benchmark gains do not translate to production.
- Benn Stancil — The metric tree and the LLM — the 2024 essay that named the "metric questions are not text-to-SQL questions" insight.
- /wiki/the-headless-bi-movement — the previous chapter on why the metric API tier exists at all; this chapter is the LLM-consumer corollary.
- /wiki/lookml-cube-metricflow-the-landscape — vendor landscape for the underlying semantic layer engines.