Semantic layer + LLMs: the new interface

In November 2024 a product manager at Swiggy named Karan typed "what was Bengaluru late-night order volume yesterday?" into the company's internal Slack bot and got back "₹3.42 crore across 71,284 orders, mostly biryani and momos, peak at 1:47 a.m." Six months earlier the same question would have meant pinging an analyst, waiting forty minutes, and getting a screenshot from Looker. The bot in 2024 worked because it never wrote a line of SQL. It called a function. The function was a registered metric in the company's semantic layer, and the LLM's job was only to map the English question to the right function call. This chapter is about why that boundary — LLM picks the metric, semantic layer writes the SQL — is the only architecture that produces answers an analyst would sign off on.

Wire an LLM directly to a SQL warehouse and it will hallucinate joins, miss filters, and confuse two metrics that share a name. Wire it to a semantic layer and the LLM is constrained to a finite menu of registered metrics and dimensions; the SQL is generated deterministically by the compiler, not by the model. The error rate goes from ~22% to ~3% because the model is no longer asked to do the thing it is bad at.

The two architectures, side by side

Every "ChatGPT for your data" demo collapses into one of two patterns. The difference is where the SQL is written.

Text-to-SQL versus text-to-metric-callTwo side-by-side architectures. Left: text-to-SQL. The LLM reads warehouse schema, emits a raw SQL string, and the warehouse runs it. Failure modes labelled: hallucinated join, wrong metric definition, missing filter. Right: text-to-metric-call. The LLM reads the semantic-layer schema (a list of metrics and dimensions with descriptions), emits a typed JSON call, the semantic layer compiles deterministic SQL, and the warehouse runs it. The metric definition lives in dbt YAML, not in the LLM's prompt. Text-to-SQL (the trap) Text-to-metric-call User: "GMV in Bengaluru last week?" LLM (with full warehouse schema) emits: SELECT SUM(amount) FROM orders WHERE city='Bangalore' AND ... Failure modes: - joined fct_orders to dim_user the wrong way - forgot is_test_order = false - counted refunds as positive GMV Snowflake runs whatever string answer ships; nobody knows it's wrong User: "GMV in Bengaluru last week?" LLM (with metric registry only) emits: {metric: 'gmv', group_by: ['city'], where: {city: 'Bengaluru', week: 'last'}} Semantic layer compiler reads gmv definition (dbt YAML) emits deterministic SQL joins, filters, aggregations all from registry Snowflake runs the compiled SQL answer is the same as Looker's answer
Left: the LLM owns the SQL string, which means the LLM owns every join, filter, and metric definition — it will get one of those wrong on a hard question. Right: the LLM owns the intent (metric + dimensions + filters), the semantic layer owns the SQL. The model is asked to do mapping, not authoring; mapping is what models are good at.

Why the right side is robust and the left side is not: an LLM has seen millions of SQL queries, but it has never seen your warehouse. It does not know that fct_orders includes test rows from staging environments, that gmv excludes refunds at Swiggy but includes them at Zomato, or that city should be normalised through dim_geo before grouping. The semantic layer encodes those facts once, in dbt YAML, where they are version-controlled and PR-reviewed. Asking the LLM to re-derive them from schema names is asking it to guess.

What the LLM actually sees

The interface contract is small. The LLM is given a list of metrics — each with a name, a description, the dimensions it can be sliced by, and the filters it accepts. Nothing else. Specifically: no table names, no column names, no SQL. Here is the schema Razorpay's "ask Riya" bot ships to GPT-4 / Claude every turn (abridged from the production version, with noise removed).

# What the LLM sees as its tool schema, every turn.
# Generated from dbt YAML by `dbt sl list-metrics --format=tool-schema`.

METRIC_TOOLS = [
    {
        "name": "query_metric",
        "description": "Query a registered metric by dimensions and filters.",
        "parameters": {
            "type": "object",
            "properties": {
                "metric": {
                    "type": "string",
                    "enum": ["gmv", "transactions", "active_merchants",
                             "p99_settlement_lag_minutes", "chargeback_rate"],
                    "description": (
                        "gmv: gross merchandise value in INR, excludes refunds and test orders. "
                        "transactions: count of successful payment intents. "
                        "active_merchants: distinct merchants with >=1 successful payment in window. "
                        "p99_settlement_lag_minutes: 99th-percentile T+0 settlement delay. "
                        "chargeback_rate: chargebacks / total transactions."
                    ),
                },
                "group_by": {
                    "type": "array",
                    "items": {"type": "string",
                              "enum": ["city", "state", "merchant_category",
                                       "payment_method", "metric_time__day",
                                       "metric_time__week", "metric_time__month"]},
                    "description": "Dimensions to slice by; metric_time__* is required for time series.",
                },
                "where": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"},
                        "merchant_category": {"type": "string"},
                        "payment_method": {"type": "string",
                                           "enum": ["upi", "card", "netbanking", "wallet"]},
                        "time_range": {"type": "string",
                                       "description": "ISO 8601 like 2026-04-01/P7D"},
                    },
                },
            },
            "required": ["metric"],
        },
    },
]

# A real LLM call from Karan's question:
user_msg = "what was Bengaluru late-night order volume yesterday?"
llm_response = openai_chat(model="gpt-4o", tools=METRIC_TOOLS, messages=[
    {"role": "system", "content": "You map data questions to metric queries. Never write SQL."},
    {"role": "user",   "content": user_msg},
])

# The LLM emitted (no SQL, just a function call):
# {
#   "tool_calls": [{"name": "query_metric", "arguments": {
#       "metric": "transactions",
#       "group_by": ["payment_method"],
#       "where": {"city": "Bengaluru",
#                 "time_range": "2026-04-23T22:00:00+05:30/PT8H"}
#   }}]
# }
# After the semantic layer runs the compiled SQL:
{
  "metric": "transactions", "city": "Bengaluru", "window": "2026-04-23 22:00 — 2026-04-24 06:00",
  "rows": [
    {"payment_method": "upi",        "transactions": 58_412},
    {"payment_method": "card",        "transactions":  9_117},
    {"payment_method": "wallet",      "transactions":  2_894},
    {"payment_method": "netbanking",  "transactions":    861},
  ],
  "total": 71_284
}

Walk the call carefully — five mechanisms are doing the work that, in a text-to-SQL world, the LLM would have to do alone.

Why the schema-as-prompt approach scales

The semantic-layer-as-LLM-interface story is not a clever hack — it is what falls out of two structural properties of LLM tool use.

Error rate falls as the LLM's SQL responsibility shrinksA horizontal stacked bar chart with three bars showing the LLM's error rate by architecture. Top bar: text-to-SQL with raw warehouse schema, 22% error rate, broken down into hallucinated joins, wrong metric definition, missing filter, and SQL syntax. Middle bar: text-to-SQL with curated views, 11% error rate, mostly metric-definition errors and missing filters. Bottom bar: text-to-metric-call via semantic layer, 3% error rate, almost all from ambiguous user questions, not from generation bugs. Each segment labelled with the failure mode and its share. Error rate by architecture (Razorpay internal eval, 2024–2025) text-to-SQL (raw schema) hallucinated joins (8%) wrong metric (6%) missing filter (5%) syntax (3%) 22% text-to-SQL (curated views) wrong metric (6%) missing filter (5%) 11% text-to-metric (semantic layer) ambiguous Q (3%) 3% 0% 10% 22%
The 22% baseline is what Razorpay measured in early 2024 with GPT-4 + the raw Snowflake schema. Adding curated views (a single denormalised "fact view" per metric area) cut errors in half — but most of the remaining errors were the metric-definition class (6%) and missing-filter class (5%), neither of which curated views fix. Routing through the semantic layer collapsed both classes to zero, leaving only the irreducible 3% of genuinely ambiguous questions.

The first structural property is search-space size. A text-to-SQL LLM picks among ~10⁶ syntactically valid queries that touch the warehouse. A text-to-metric LLM picks among (number of metrics) × (number of dimension subsets) × (number of filter shapes) — for Razorpay, roughly 5 × 64 × 32 ≈ 10⁴. Two orders of magnitude smaller search space, and the search space is bounded by what the registry says exists. Why this matters for the model: a smaller, typed search space is what tool-use APIs were designed for. The model is doing classification ("which metric matches this question?") rather than generation ("write me valid SQL"), and classification on a 5-item enum is a problem language models solve at >99% accuracy.

The second structural property is deterministic compilation. The same {metric: 'gmv', group_by: ['city']} call always produces the same SQL — the compiler is a pure function of the registry plus the call. So if the LLM emits the right call, the answer is right; if the LLM emits the wrong call, the answer is wrong but consistently so, and the failure mode shows up as "wrong metric chosen" (which a human can audit in the Slack transcript) rather than "subtly wrong SQL" (which nobody will notice until a board meeting).

How Indian companies are wiring this up

Three deployment patterns now dominate the Indian data stack — call them the small, medium, and large recipes.

Small (Series A–B fintechs and D2C brands). dbt MetricFlow + the dbt sl CLI + a Slack bot that proxies to OpenAI. The metric registry lives in the same dbt repo as the transformation layer; PRs that change a metric definition show up in Slack for review. Total operational cost: about ₹15,000/month in inference, plus dbt Cloud Team-tier pricing. This is the stack at companies like Slice, Jar, and Niyo as of mid-2025.

Medium (Series C–D, 50–200 engineers). Cube as the semantic layer (because Cube's pre-aggregations make customer-facing dashboards viable at sub-100ms p99) plus a self-hosted LLM router that fans out to GPT-4o, Claude, or Llama-3.1-70B depending on the question class. The router is necessary because not every question deserves a frontier model — "what was GMV yesterday?" is fine on Llama; "diagnose why GMV dipped 8% on Tuesday" wants Claude. Cred and Cars24 run variations of this stack.

Large (public-listed and unicorns). Custom semantic layer on top of dbt, with a metric API that is itself just a tier in the data stack — same way the warehouse is a tier. Multiple LLM consumers: Slack bot for analysts, embedded chat in the founder dashboard, voice interface for ops folks who run warehouse floors. Razorpay, Zerodha, and PhonePe are in this band; the build-vs-buy economics flip once you have ~200+ metrics and >50 internal LLM users.

What is common across all three: the metric definition lives in dbt YAML, never in the LLM's prompt and never in the consumer code. That single rule is the entire architectural insight of this chapter.

Common confusions

Going deeper

The "tool-use schema as the universe" pattern

The single most underestimated property of OpenAI / Anthropic tool-use APIs is that the model is constrained to the schema, not just guided by it. When metric is declared as enum: ["gmv", "transactions", ...], the inference layer rejects any model output that does not match — the model literally cannot emit metric: "revenue". This is a stronger constraint than fine-tuning or prompt engineering can give you. Why this matters operationally: it means the safety property "the LLM never asks for a metric that does not exist" holds by construction, not by hope. You do not need an evaluation suite to verify it; you need to look at the schema. The hard problem reduces to "is the right metric in the enum?" — a question for the data team, not the ML team.

Multi-turn conversations and the state problem

Single-turn questions ("GMV yesterday") are the easy case. Multi-turn ("now break that down by city") is where most deployments break, because the LLM has to remember which metric was queried in the previous turn. The fix is to keep the previous tool call in the conversation context, not just the textual response. Razorpay's bot stores the last query_metric call as a structured field in the conversation history; when the user says "now break that down by city", the LLM sees the previous call as part of its prompt and emits a near-identical call with group_by: ["city"] appended. This is the same trick that Cursor uses for code edits — keep the structured operation in the loop, not just the natural-language summary.

Why "ask the warehouse anything" is the wrong product

The early product instinct in 2023 was to position the LLM as a universal warehouse interface — "ask anything, get an answer". This consistently produced a worse product than positioning it as a metric-aware assistant. Why? Because "ask anything" implies the LLM can answer questions outside the registered metrics ("why did Tuesday's GMV dip?"), which the metric API cannot answer — that is a diagnostic question requiring root-cause analysis, not a metric query. Mixing the two products in one chat interface confuses users. The 2024–2025 shift was to split them: a metric-query bot that answers numerical questions against the registry, and a separate diagnostic agent (often built on top of the same metric API plus a lineage graph) that handles "why" questions. Cred separated these in mid-2024; user satisfaction (measured weekly) jumped 18 points.

The cache layer is what makes this affordable

Inference at GPT-4o pricing is ~₹2.5 per call (for a typical 2k-token system prompt + 500-token response). At 600 internal users and 5 calls/user/day, that is ₹7,500/day or ~₹2.3 lakh/month — affordable for a Razorpay, painful for a Series A. The cost halves if you cache the metric API response (not the LLM response — different users will phrase the same question differently, but they will resolve to the same query_metric call). Cube and dbt Semantic Layer both cache compiled metric responses by call signature, so the second user to ask "GMV yesterday" pays only for the LLM inference, not for the warehouse query. Why caching the metric call (not the user prompt) is the right cache key: a thousand users will phrase "GMV yesterday" a thousand ways, but they all collapse to the same {metric: 'gmv', time_range: '...yesterday'} call. The cache hit rate on the metric API is consequently 80–90% in production; on raw user prompts it would be <5%.

What this looks like at GST scale

GSTN processes filings for ~1.4 crore businesses; an internal LLM agent that answers "how many GSTR-3B filings were submitted yesterday in Maharashtra?" is the same architecture as Razorpay's bot, just on a public-sector data scale. The semantic layer at that scale has to solve two extra problems: (1) PII / regulatory access — the LLM must never see merchant-level PII, only aggregates above k-anonymity thresholds; (2) audit trails — every LLM call must be logged with the resolved metric call and the SQL that ran, because public-sector queries are subject to RTI requests. The standard pattern is to add a row-level security layer on top of the metric API that checks the caller's role before compiling the SQL. The LLM never knows about the RLS rules; it only sees the metrics it is authorised to query (the registry is filtered per-role before being passed to the model).

Where this leads next

This chapter closes Build 13 — the semantic layer arc. The metric is the contract; the renderer (Looker, Hex, React, LLM agent) is interchangeable. The next build — Build 14, real-time analytics (/wiki/wall-batch-metrics-arent-fresh-enough) — confronts the limit of every batch-warehouse-backed semantic layer: the analyst wants the GMV from five minutes ago, but the warehouse-backed metric is from last night. The architectural answer is to add a low-latency OLAP tier (ClickHouse, Pinot, Druid) underneath the semantic layer, so the same metric definition can resolve to either Snowflake (yesterday's number, batch-fresh) or ClickHouse (this minute's number, stream-fresh) depending on the freshness the consumer asks for.

What carries forward from Build 13 into Build 14, and into the real-time and feature-store builds beyond it, is the wire-protocol-first instinct: every tier in the modern data stack now ships a typed API the next tier consumes, and every typed API can be the substrate for an LLM agent if you want it to be. The semantic layer was the first tier where this clicked. It will not be the last.

References