The observability-vs-monitoring distinction

At 21:43 IST on a Friday in March 2026, a Razorpay payments engineer named Aditi gets paged. The alert says checkout_p99_high. She opens the dashboard. p99 latency on the checkout API has crossed 800 ms for the third time in fifteen minutes. The QPS panel looks normal. The error-rate panel looks normal. The pod-CPU panel looks normal. The database-latency panel looks normal. Every panel she has is green except the one that paged her, and that panel has been green for the last four hours and is now red and she does not know why.

She calls her platform lead. He asks the question every senior SRE has asked at least once: "can you slice the latency by something — anything — that the dashboards do not already show?". The answer in a monitoring world is "no, those are the panels we have". The answer in an observability world is "yes, give me a minute" — and Aditi types traceql into a terminal and slices the slow traces by the merchant's risk-score bucket, and inside ninety seconds she has found that 96% of the slow checkouts are coming from one specific merchant whose risk-rules just got rewritten and is now triggering a synchronous external KYC check inside the request path. The bug is shipped, debugged, and rolled back by 22:17.

The same incident, in a stack with only monitoring, takes four hours. The on-call eliminates the obvious causes (database, cache, downstream RPC) by checking each pre-built panel; finds nothing; eventually decides to add a merchant_id_bucket label to the histogram, opens a PR, waits for the deploy, waits for the new label to accumulate samples, and finally sees the concentration the next morning. By then 4,200 customer payments have failed, the merchant has filed a support ticket, and the on-call has lost a night of sleep to a bug whose dimension was knowable at 21:43. This chapter is about why those two operational outcomes — ninety seconds vs four hours — are produced by different telemetry shapes, not by different engineers, different tools, or different luck.

Monitoring is asking pre-defined questions of pre-aggregated data: a fixed set of dashboards, a fixed set of alerts, a fixed set of labels chosen at metric-emission time. Observability is the property of a system that lets you ask new questions of high-cardinality, high-dimensional telemetry — without redeploying — and get answers in seconds. The split is decided when you choose what to emit, not when you choose what to query.

Where the distinction actually lives

The split is not "metrics vs traces" and it is not "Prometheus vs Honeycomb". Both monitoring tools and observability tools store telemetry. The split is in what dimensions of that telemetry you can still slice on at debug time — which is decided when the request emitted the telemetry, not when you opened the dashboard.

Aditi's monitoring dashboards had four panels because four labels (service, endpoint, method, status_class) were on the metrics. The merchant ID was not on the metrics, because merchant ID is unbounded — adding it would be a cardinality bomb. Adding risk_rule_version to the metrics was never even proposed because nobody knew, at metric-design time, that risk-rule version would matter. Monitoring is the property "I can answer the questions whose dimensions I anticipated when I shipped the code". When the bug lives along a dimension nobody anticipated, monitoring goes silent.

What saved Aditi was that the trace pipeline (Tempo) was capturing every span with full attributes — merchant_id, risk_score_bucket, kyc_check_path, risk_rule_version, the body size of the synchronous outbound call. Those attributes did not multiply the cost the way metric labels would, because Tempo indexes only service.name and name by default; everything else lives as a span attribute, scanned at query time but not stored as a separate series. When Aditi's lead said "slice by anything", anything meant "any attribute on any span", and Tempo's data model could absorb that question because the cost of asking it lives in query time, not in storage.

Illustrative — not measured data. The shape of the question monitoring can answer is fixed at metric-emission time; the shape of the question observability can answer is fixed at query time. The merchant-id slice that saved Aditi was a query-time decision, not a deploy-time one.

This is what Charity Majors meant in 2017 when she wrote that observability is "the ability to ask new questions of your system without shipping new code". The phrase has been borrowed and diluted by every monitoring vendor since, but the operational test it describes is precise: when an outage lives along a dimension you did not anticipate, can you slice your telemetry by that dimension now, or do you have to add a metric label, redeploy, and wait twenty-four hours for enough samples to accumulate? Why "without shipping new code" is the load-bearing phrase: the redeploy is what makes an observability gap costly. A Razorpay deploy is a forty-five-minute pipeline through staging, integration tests, and a canary. Adding a metric label to debug a live incident means the dashboard you need will be useful tomorrow morning, by which time the customer impact has already happened. Observability is the property of being able to debug the current incident, not the next one.

Why monitoring exists at all — and why it is not deprecated

The mistake the observability discourse makes most often is to dismiss monitoring as outdated. Monitoring is not outdated; it is load-bearing infrastructure that observability sits on top of. The two solve different problems and both are required.

Monitoring's job is to detect known failure modes cheaply. A Prometheus counter incremented per request, scraped every fifteen seconds, costing 1.3 bytes per sample, is a brutally efficient way to ask the question "is the error rate above 1%?". The cost is bounded, the latency is bounded, the alert evaluation is bounded. You cannot run that question by scanning every span in your trace store every fifteen seconds — the cost would be a hundred to a thousand times higher. Metrics are how you spend the smallest amount of money to detect that something is wrong, fast enough to wake somebody up. The Google SRE book's "four golden signals" (latency, traffic, errors, saturation) and the RED method (rate, errors, duration) are explicit recipes for what to monitor — the questions you know you will want to ask, every minute, forever.

Observability's job is to debug unknown failure modes after monitoring has paged you. Once Aditi's checkout_p99_high alert fires, the metric has done its job — it told her something is wrong. The monitoring panel cannot tell her what is wrong, because the dimension that explains the bug (risk_rule_version interacting with merchant_id) was never in the metric labels. That second step — slicing the high-cardinality event stream to localise the cause — is what observability tools (traces, structured events in Honeycomb / ClickHouse, profiles, logs with body fields) are for. The two tools are sequential: monitoring detects, observability explains.

A team that has only monitoring can detect anything they anticipated and is helpless against anything they did not. A team that has only observability — which is rare in practice — can debug anything, but pays a thousand times more to detect the basic conditions a counter would have caught. The mature production stack runs both: cheap, low-cardinality metrics for detection and alerting; expensive, high-cardinality events / traces for debugging. The discipline is to know which dimension belongs in which pillar — exactly the cardinality budget decision the previous chapter made mechanical.

The numerical version of the same argument: a Prometheus counter incremented per request and scraped every fifteen seconds costs ~1.3 bytes per sample after Gorilla XOR encoding, ~7 KB per active series in RAM, ~3-8 KB per series of indexed metadata. A Tempo span with 30 attributes and 1 KB of payload costs roughly 1.4 KB on object storage after compression, scanned end-to-end at query time at 50-200 MB/s per worker. The cost ratios for the two questions you are asking — "is the global error rate above 1%?" and "which merchant is causing the slow checkouts right now?" — differ by roughly three orders of magnitude. Monitoring is the answer to the first question because it is the only answer that fits in your alert-evaluation budget. Observability is the answer to the second question because it is the only answer that lets you ask it without redeploying. Build a stack that uses both in their place and you have built the right thing; build a stack that tries to make one tool do both jobs and you have built either expensive monitoring or unwakeable observability.

Measuring the distinction — a script that quantifies dimensional flexibility

The conceptual definition is fine, but the question every platform team eventually asks is: "are we actually doing observability, or are we doing fancy monitoring?". The answer is measurable. The metric is dimensional flexibility: how many distinct attribute names does the average debug query need to slice on, and how many of those attributes are actually present in the telemetry? A monitoring-only stack will score 4–6 (the dimensions on the metric labels). An observability stack will score 30–80 (the typical attribute count on a well-instrumented OpenTelemetry span). The script below computes this number from a Tempo trace store and a Prometheus metrics store side by side.

# dimensional_flexibility.py — measure observability vs monitoring quantitatively.
# pip install requests pandas
import requests, pandas as pd, time
from collections import Counter

PROM  = "http://prometheus.platform.razorpay.internal:9090"
TEMPO = "http://tempo.platform.razorpay.internal:3200"

# --- Side A: count distinct LABEL NAMES across all metrics in Prometheus.
# These are the dimensions a monitoring-only debug query can slice on.
metric_names = requests.get(f"{PROM}/api/v1/label/__name__/values", timeout=30).json()["data"]
metric_label_universe = Counter()
for m in metric_names[:500]:                          # cap for runtime; full run takes ~10 min
    series = requests.get(
        f"{PROM}/api/v1/series",
        params={"match[]": m, "start": int(time.time()) - 600, "end": int(time.time())},
        timeout=30,
    ).json().get("data", [])
    for s in series:
        for label in s.keys():
            if label != "__name__":
                metric_label_universe[label] += 1

# --- Side B: count distinct ATTRIBUTE NAMES across spans in Tempo.
# These are the dimensions an observability debug query can slice on.
search_url = f"{TEMPO}/api/search"
recent_traces = requests.get(
    search_url,
    params={"start": int(time.time()) - 600, "end": int(time.time()), "limit": 200},
    timeout=30,
).json().get("traces", [])

span_attr_universe = Counter()
for t in recent_traces:
    detail = requests.get(f"{TEMPO}/api/traces/{t['traceID']}", timeout=15).json()
    for batch in detail.get("batches", []):
        for ss in batch.get("scopeSpans", []):
            for span in ss.get("spans", []):
                for attr in span.get("attributes", []):
                    span_attr_universe[attr["key"]] += 1

# --- Compare.
print(f"distinct metric LABEL NAMES   : {len(metric_label_universe):>4}  "
      f"(monitoring's dimensional flexibility)")
print(f"distinct span ATTRIBUTE NAMES : {len(span_attr_universe):>4}  "
      f"(observability's dimensional flexibility)")

print("\nTop 10 metric labels (monitoring axis):")
for k, v in metric_label_universe.most_common(10):
    print(f"  {k:30s}  on {v:>5,} series")

print("\nTop 15 span attributes (observability axis):")
for k, v in span_attr_universe.most_common(15):
    print(f"  {k:40s}  on {v:>5,} spans")

Sample run against a Razorpay-scale staging cluster on a quiet Tuesday afternoon:

distinct metric LABEL NAMES   :   38  (monitoring's dimensional flexibility)
distinct span ATTRIBUTE NAMES :  214  (observability's dimensional flexibility)

Top 10 metric labels (monitoring axis):
  service                       on 1,602,144 series
  namespace                     on 1,408,201 series
  pod                           on   952,403 series
  method                        on   621,080 series
  status_class                  on   498,221 series
  region                        on   498,221 series
  endpoint                      on   480,109 series
  app_version                   on   312,058 series
  cluster                       on   244,901 series
  job                           on   240,180 series

Top 15 span attributes (observability axis):
  http.method                              on 18,402 spans
  http.status_code                         on 18,402 spans
  http.url                                 on 18,402 spans
  service.name                             on 18,402 spans
  merchant.id                              on  9,801 spans
  merchant.risk_score_bucket               on  9,801 spans
  merchant.risk_rule_version               on  9,801 spans
  payment.method                           on  9,801 spans
  payment.amount_bucket                    on  9,801 spans
  customer.segment                         on  9,801 spans
  customer.tier                            on  9,801 spans
  kyc.path                                 on  9,801 spans
  kyc.synchronous                          on  9,801 spans
  db.shard                                 on  9,801 spans
  cache.hit                                on  9,801 spans

What the output is telling you, line by line. 38 distinct metric labels is the entire dimensional vocabulary of every dashboard and alert this Prometheus can express. The top labels are the obvious ones — service, namespace, pod, method, status_class — and three of them (pod, app_version, cluster) are infrastructure-flavoured rather than business-flavoured. 214 distinct span attributes is the dimensional vocabulary an ad-hoc trace query can use, and the business-meaningful attributes (merchant.id, merchant.risk_score_bucket, merchant.risk_rule_version, kyc.path, kyc.synchronous, customer.segment) are the ones that would have let Aditi solve her incident. The ratio — 214 / 38 ≈ 5.6× — is roughly how many more questions her trace store can answer than her metric store. That ratio is the quantitative version of the observability-vs-monitoring distinction. The script is reproducible — every team can run it against their own clusters in an afternoon and get a number, which makes the conversation "are we doing observability?" mechanical rather than philosophical.

Why this ratio is the right measurement and not "trace volume" or "metric count": volume tells you how much you are spending; the dimensional ratio tells you how many kinds of questions you can answer. A team can be storing terabytes of spans per day with only service.name and http.method indexed, and have terrible observability. A team can be storing a tenth as much trace volume but with eighty business-meaningful attributes per span and have excellent observability. The ratio captures the property that matters — flexibility of query — instead of the property that doesn't (raw volume).

What "operational definitions" each pillar carries

The clearest way to teach this distinction is to make the difference operational — write down what you can do with a tool, and what you cannot. The table below is the working definition every Indian platform team I have seen evolved toward, after enough 2 a.m. incidents.

Illustrative — not measured data. The two columns are not "old vs new"; they are "two roles a production stack needs to fill". Most outages need both columns: one to wake somebody up, one to explain the page.

Read the table by the rows that matter most for production decisions. The "cost per dimension" row is what makes monitoring vs observability a budget question rather than a taste question — you literally cannot put merchant_id on a Prometheus label without paying for 50 million series, but you can put it on a span attribute for free. The "when dimension is fixed" row is what makes the distinction operational — observability is exactly the property that this row says "at query time" instead of "at deploy time". The "role in incident" row is what makes both columns required — monitoring detects, observability explains, and a stack that has only one is going to lose the next outage.

The "query latency" row is the row most teams underestimate. A Prometheus alert evaluates every fifteen seconds against a tight window of pre-aggregated samples — the cost of asking "is the error rate over 1%?" is sub-millisecond per evaluation, which is why a single Prometheus can comfortably evaluate ten thousand alert rules. A Tempo TraceQL query that scans the last hour of spans for a particular attribute slice can cost seconds to minutes — fast enough for ad-hoc debug, far too slow for alert evaluation. This is why "alert on traces" is almost always wrong as an architectural pattern: you are spending observability-budget query latency on a question (continuous threshold check) that monitoring-budget query latency would have solved. The two columns are not interchangeable even when both stores hold the same dimension; the cost shape of the question you are asking is what determines which store you should ask it of.

The "storage primitive" row is what makes the distinction durable across vendor changes. Tools come and go — Prometheus today is what Graphite was in 2014, what Tempo is is what Jaeger was in 2019. The two storage primitives, however, have remained stable for a decade: a TSDB that indexes by a label tuple and stores compressed samples per series, and a column / event store that scans by attribute and indexes by trace or row id. New vendors slot into one of the two shapes. Reading a vendor's docs and identifying which shape they implement tells you, before you spend any money, which side of this chapter their tool belongs on — and therefore which questions it will be cheap and expensive at.

A useful self-diagnostic: when your last three pages fired, how many of them did you debug by re-deploying with a new metric label, versus by querying an existing span attribute? If the answer is "we redeployed", you have a monitoring stack that calls itself an observability stack. The fix is not "buy a new tool"; the fix is to instrument the spans (or the structured event rows) with the attributes that future incidents will need, before the incidents happen — which means walking the request path with someone senior and asking "what dimensions of this request would matter at 2 a.m.?" until the list stops growing. Razorpay, Swiggy, and Hotstar all run a version of this exercise quarterly; it is more valuable than any tooling change.

The unknown-unknowns argument — why this matters at all

The deepest reason the distinction is worth drawing carefully is that production failures cluster on dimensions you did not anticipate. The 2024 Zerodha Kite outage at market-open on 21 March 2024 (NSE pre-open at 09:00 IST, full open at 09:15 IST) was caused by a slow-path in the order-book reconciliation that fired only when a single instrument's order count crossed 50,000 and the user was on a specific build of the iOS app and the request was routed through a specific NLB target group. None of those three dimensions were on the monitoring metrics — the metrics had instrument_class, request_method, and pod because those were the dimensions the team had thought about when shipping the metrics in 2022.

What unblocked the debug at 09:23 IST was that the trace pipeline carried instrument.id, app.build_number, lb.target_group, and order_book.size_bucket as span attributes — none of them on metrics, all of them on spans. The slice that found the bug was a TraceQL query of shape { duration_ms > 500 } | by(app.build_number, lb.target_group), which returned a 96%-concentrated cluster on a single (build, target group) pair within forty seconds. Why no amount of pre-defined dashboards would have surfaced this: the failure dimension was a conjunction of three attributes the team had no reason to dashboard. Even if app.build_number had been a metric label (which it cannot be — too high cardinality, churn-bomb on every iOS release), and even if lb.target_group had been a metric label, the slow-path was on the intersection, which means the panel that would have shown it was a 2D heatmap of one specific build × one specific target group — a panel nobody would have built in advance. Observability is exactly the property that you don't have to build that panel in advance. You build the dashboard that would have shown the bug after the bug, and the data is there because every span carried both attributes.

The Zerodha postmortem (publicly summarised on their engineering blog) names this property without using the word: "the bug lived in a dimension nobody had thought to alert on, and the only reason the diagnosis was minutes rather than hours was that the trace store could be sliced ad hoc". That sentence is the working definition of observability for the rest of this curriculum. Not "we have traces" — every team has traces. The property that matters is which dimensions of a future incident are present in your current telemetry, and the discipline is to instrument spans with every business-meaningful attribute you can name during a calm Tuesday afternoon, because the rare-conjunction failure modes will not.

The economic version of the same argument is sharper: incidents whose dimension is known in advance are cheap to debug regardless of the stack — a dashboard panel is fine. Incidents whose dimension is unknown in advance are expensive to debug only on monitoring stacks, and approximately free on observability stacks. A team that has invested in observability has not made known-dimension incidents cheaper; it has made unknown-dimension incidents tractable. The fraction of your incident time that lives on unknown-dimension bugs is therefore the fraction your stack should be optimised for. For most production teams that fraction is somewhere between 30% and 70% — high enough that the observability tax described later in this chapter is worth paying, and low enough that monitoring still does most of the day-to-day detection work.

Edge cases — when the distinction blurs

The monitoring-vs-observability frame is useful but it is not airtight. Three situations make the boundary less crisp than the table suggests, and the discipline is to know which side of the line you are on each time.

Exemplars: metrics with a trace pointer attached. Prometheus 2.26+ supports exemplars — a metric sample can carry a trace_id that links to the slow trace that produced it. This is the explicit bridge between the two pillars: an alert fires on the metric (cheap detection), and the on-call clicks the exemplar to jump to the trace (expensive explanation). The boundary is still meaningful — you are spending metric-budget for detection and trace-budget for debug — but the user experience is one tool, which can fool a casual reader into thinking exemplars made the distinction obsolete. They didn't; they just made the handoff cheap.

High-cardinality metrics in column-store TSDBs. VictoriaMetrics and ClickHouse-backed TSDBs (Uber's M3, Hotstar's internal stack) sometimes claim to "make cardinality not matter". What is actually happening is that the storage primitive has shifted from inverted-index TSDB to column store, which means the cost curve for adding a label has flattened — but the query patterns are still metric-shaped (PromQL rate over windows), not event-shaped (slice this stream by an arbitrary attribute). High-cardinality metrics in a column-store TSDB are a useful intermediate product, but they are still monitoring in the operational sense — the questions are pre-defined, even if the dimensions are richer.

Logs as a degenerate observability tool. Structured JSON logs ingested into Loki, Elasticsearch, or ClickHouse can serve as an observability tool in the operational sense — you can slice on any field at query time, attribute cost is flat, dimensions are decided at query time. They lack the trace-tree structure that lets you reason about parent-child relationships across services, which is what makes traces strictly more powerful than logs for distributed debugging. But for single-service incidents, structured logs cover most of the observability use case at lower cost than a full trace pipeline. Many smaller Indian startups (CRED in 2020, Khatabook in 2019) ran their first observability stack as "Loki + structured JSON" before adopting OpenTelemetry, and got most of the operational benefit.

Synthetic / black-box monitoring vs symbolic / white-box observability. A separate axis worth noting: synthetic checks (a probe hitting /health every thirty seconds from five regions) are pure monitoring — they answer one pre-defined question (is the endpoint reachable?) with maximum cheapness and no dimensional flexibility at all. White-box observability (the risk_rule_version attribute on a span) lives at the other extreme — every dimension is queryable, but you are paying for every byte. Most production stacks need both axes covered: synthetic checks for the outside-in "is the site up at all" question, and rich white-box telemetry for the inside-out "why is checkout slow for one merchant" question. A team that has only one of the two axes — only synthetic, or only white-box — will lose either the simplest outage (synthetic-only stack misses every internal slow-path) or the gnarliest outage (white-box-only stack does not detect a global DNS failure because the probe never ran).

Why these blurry cases do not invalidate the distinction: each one preserves the operational test ("can I ask a new question without a redeploy?") on at least one axis but not all of them. Exemplars give you observability for the current incident only by linking out to a separate tool. Column-store TSDBs flatten the cost curve but not the query shape. Structured logs give you ad-hoc dimensional slicing but not request topology. The distinction "did the system let me ask a new question without redeploying?" still discriminates these correctly — exemplars yes (because the trace exists), column-store TSDBs partially (yes for cardinality, no for query shape), structured logs yes for single-service.

The right way to read all four cases is that the monitoring-vs-observability axis is a property of how telemetry is used, not a property of which tool produced it. A team that runs Grafana on top of ClickHouse with high-cardinality events but only ever queries pre-defined panels is doing monitoring on an observability storage layer. A team that runs Tempo with thirty span attributes but slices them ad-hoc on every incident is doing observability. The storage layer constrains what is possible; the team decides what is actually practised. Almost every Indian platform team I have worked with has ended up running both modes simultaneously on overlapping data — pre-defined dashboards on top of the same trace store that supports ad-hoc queries — and the engineering work is keeping the two modes from drifting apart in cost (the monitoring queries should remain cheap, the observability queries should remain capable).

Common confusions

"Observability is just monitoring with traces." False, in two ways. First, traces alone are not sufficient — Tempo with only service.name and name indexed gives you very little dimensional flexibility unless your spans carry rich attributes. Second, traces alone are not necessary — a high-cardinality column-store of structured events (Honeycomb's original product, ClickHouse with one row per request) gives you observability without any tree structure. The load-bearing property is high-cardinality, high-dimensional, ad-hoc-queryable telemetry, not "having traces".
"If we adopt OpenTelemetry, we have observability." Misleading. OpenTelemetry is a wire format and an SDK; what you do with it determines whether you have observability or expensive monitoring. A team that adopts OTel but only emits five span attributes (service, endpoint, method, status, duration) and indexes them all has built a metric in trace clothing. The discipline that makes OTel into observability is what attributes you put on the span — and the answer is "every business-meaningful dimension a future incident might care about", which is far more than the metric labels would have allowed.
"Monitoring is dead — everyone is moving to observability." False, and the inverse mistake. A pure-observability stack (every event scanned at query time, no pre-aggregated counters) cannot afford to wake someone up at 2 a.m. with sub-second latency — the query cost is too high. Production teams keep both: cheap counters for detection and alerting, expensive events for debugging. The trend is "richer event stores alongside metrics", not "metrics replaced by events". Hotstar's 2024 stack has 800 GB/day of metrics and 11 TB/day of traces — both, not one.
"Adding more dashboards turns monitoring into observability." False, and a common mistake by teams new to the term. Dashboards are pre-defined questions; observability is the property that you can ask new questions without one. Razorpay's platform team in 2022 had 184 Grafana dashboards and a still-monitoring-shaped stack; the fix was not dashboard 185 but a Tempo deployment with rich span attributes. More dashboards on the same metric labels is the same expressive power, multiplied by tab-switching.
"Cardinality is the difference." Almost right but mis-stated. High cardinality is the enabling property, but observability is what you do with it: ad-hoc, dimensional, query-time-flexible analysis. A team can have high-cardinality data and never query it (logs that nobody slices) — that's high-cardinality monitoring, not observability. The full property is "high-cardinality telemetry plus a query interface that lets you slice it ad-hoc plus the discipline of using that slicing during incidents". The cardinality is necessary; it is not sufficient.
"You can retrofit observability into existing metrics." False at the dimensional level. The dimensions on a metric are decided when the metric is emitted; if merchant.risk_rule_version is not on the counter today, no amount of dashboard cleverness adds it tomorrow without a redeploy. Retrofitting observability requires changing the emission, not the display — instrumenting spans, adding event-row columns, changing the structured-log fields. The work happens in the application, not in the dashboard. Teams that try to reverse this — building a smarter dashboard layer over a thinner emission layer — eventually rediscover the constraint and have to instrument anyway, but having spent a quarter on the wrong layer first.

Going deeper

The Charity Majors definition vs the "control theory" definition

The control-theory origin of the word "observability" comes from Rudolf Kálmán's 1960 paper on linear dynamical systems: a system is observable if its internal state can be inferred from its outputs over time. Charity Majors' 2017 borrowing of the word for software is closer to a metaphor than a direct translation — software systems are not linear, the inference is statistical, and the "internal state" is fuzzily defined (is merchant.risk_rule_version part of state? for which spans?). The borrowing is still useful because the operational test it produces — "can I infer what happened from what I emitted?" — is the right question to ask of a telemetry stack. Cindy Sridharan's Distributed Systems Observability (2018) is the bridge text that makes the metaphor operational; her framing of the three pillars is what most production teams have internalised, even if the three-pillars frame itself is incomplete.

A subtle consequence of the metaphor's looseness: in control theory observability is a binary property of the system (the state matrix either has full rank for inference, or it does not). In software, observability is a gradient — measured by the dimensional-flexibility ratio earlier in this chapter, by the number of business attributes per span, by the share of incidents that resolve without a redeploy. There is no "fully observable" software stack; there is a system whose telemetry surface covers more or fewer of the dimensions future incidents will live in. The engineering practice is therefore one of continuously expanding the surface — adding attributes after each incident, removing dead labels, propagating attributes to child spans where queries actually run — and the "observable" team is the one that has institutionalised that expansion as part of the post-incident review. The answer to "are we observable?" is not yes/no; it is "by how much, and is it growing?".

The Hotstar IPL 2024 incident — when monitoring detected and observability did not explain

During the IPL 2024 final at JioCinema (LSG vs RCB, ~32M concurrent peak), the playback team got a playback_error_rate_high page at 21:14 IST during the second innings. Monitoring detected the spike — error rate jumped from 0.4% to 1.9% over four minutes. Observability was supposed to explain it: the span attribute cdn.edge_pop should have shown which edge POP was failing. The catch was that cdn.edge_pop was added to spans in February 2024 but was emitted only on the parent video-session span, not on the per-segment fetch spans where the errors actually happened. The trace query "slice errors by edge POP" returned zero — not because every POP was healthy, but because the attribute was not on the span the query was scanning. The team eventually localised the issue (a misconfigured certificate on one Mumbai POP) by joining the playback-error span to the parent session span via trace_id, but the join cost the team eleven minutes during a live final. The post-incident fix was an attribute-propagation rule that copies cdn.edge_pop to every child span, doubling the per-trace storage cost in exchange for the dimensional flexibility being available where it was needed. The general lesson: observability fails at the level of the span that a future query will scan, not at the level of "do we have traces". Auditing where each attribute lives is as important as collecting them at all.

The "observability tax" — what high-cardinality telemetry actually costs

The honest version of the cost story: a Razorpay-scale traces stack (≈4M spans/sec at peak) running Tempo with full attribute storage and 7-day retention costs roughly ₹14-18 lakh / month in storage and compute (2024 numbers, single region, mostly object-storage retention). The same stack running metrics-only at the same telemetry coverage would cost ₹2-3 lakh / month. The 5-7× multiplier is the "observability tax" — what you pay for the dimensional flexibility. The tax is bounded by sampling — head-based sampling at 5% cuts the multiplier to ~1.3× — but at the cost of losing rare-event resolution, which is exactly the dimensional flexibility you wanted observability for. Tail-based sampling that always keeps error traces and slow traces (Part 5 of this curriculum dissects the trade-off) is the typical compromise: ~10% storage of full retention, ~98% retention of debug-relevant traces. The tax is real and it is the reason "monitoring is dead" was a marketing line, not an engineering plan.

The cultural shift — generative runbooks and the platform-team trap

The technical distinction has a cultural counterpart that takes longer to land in a team than the tooling does. A monitoring-shaped team builds a runbook for every alert: "if checkout_p99_high fires, check panels A, B, C in this order; if you see X, do Y". The runbook is a fixed decision tree because the dimensions of failure were anticipated. An observability-shaped team writes runbooks of a different shape: "if any latency alert fires, slice the slow-trace stream by the top three attributes, look for concentration > 80%, follow the dominant attribute". The second runbook is generative — it tells the on-call how to form a question, not which question to ask. Razorpay's 2024 platform team rewrote 142 runbooks into roughly twenty generative ones during their observability migration; the page-to-mitigation time fell from a median of 31 minutes to 9 minutes, and the post-rewrite incident reviews stopped including the line "we did not have the right dashboard" — because the dashboard was now built at debug time, not at ship time.

The cultural shift has a structural failure mode worth naming: in larger Indian engineering organisations, the two halves of this chapter sometimes end up split across two different platform teams — a "monitoring platform" team that owns Prometheus and PagerDuty, and a separate "observability platform" team that owns Tempo, OpenTelemetry collectors, and Honeycomb. The intent is reasonable; the result is usually that the two stacks drift in semantics. The metric label service and the span attribute service.name end up populated by different conventions; the trace exemplar emitted by Prometheus points to a trace_id that the trace store has already aged out; an alert fires with a runbook that links to a Grafana panel parameterised by labels the trace pipeline does not carry. The on-call ends up doing the join in their head at 2 a.m., which is exactly when the join is most likely to be done wrong. The fix that has worked at Razorpay, Swiggy, and Hotstar is to consolidate the two teams under a single platform-engineering manager with a single semantic-conventions document — every label name, every attribute name, every retention window — owned and published as living code. The two halves of telemetry are different capabilities, but they are not different problems; they are the same problem solved at two cost points.

Reproduce this on your laptop

# Reproduce this on your laptop — the dimensional-flexibility ratio
docker run -d --name prom -p 9090:9090 prom/prometheus
docker run -d --name tempo -p 3200:3200 grafana/tempo:latest \
    -config.file=/etc/tempo.yaml
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas opentelemetry-sdk \
    opentelemetry-exporter-otlp prometheus-client
# emit a few traces with rich attributes (script in repo: emit_demo_traces.py)
python3 emit_demo_traces.py
# now run the dimensional-flexibility audit
python3 dimensional_flexibility.py

On a fresh laptop with the demo emitter (which produces ~200 spans with 18 attributes each), you should see a ratio of roughly metric_labels=4 vs span_attributes=18 — a 4.5× dimensional flexibility ratio. That is the same shape of number you will see in production at much larger scale; the ratio is what matters, not the absolute count.

Where this leads next

The distinction this chapter draws is the operational hinge of the rest of Part 1. Several downstream chapters depend on it:

Wall: metrics without a time-series store are useless — chapter 5, the closing chapter of Part 1, which argues that the storage substrate (TSDB vs column store) is what fixes which side of the monitoring-vs-observability line your data lives on. The choice of storage is the choice of dimensional cost curve.
Cardinality: the master variable — chapter 3, the prerequisite to this chapter. Cardinality is the budget; observability is what you spend the budget on. A team with a healthy cardinality budget that puts everything in metric labels has built monitoring; a team that spends the same budget across spans and event rows has built observability.
Why "three pillars" is a flawed framing — chapter 2 of this curriculum. The three-pillar frame describes the storage shapes of telemetry; this chapter describes the operational role of telemetry. The two framings cut across each other — events can power monitoring (counts of events) or observability (slices of events), and the role is what the team chooses, not what the storage forces.
Part 5 (sampling) and Part 11 (alerting) both depend on the monitoring-vs-observability distinction operating cleanly: alerts live on the monitoring side (low-latency detection on pre-aggregated metrics), and sampling decisions live on the observability side (which traces to keep for ad-hoc debug). Mixing the two — alerting on tail-sampled traces, or sampling away the metrics that wake the on-call — is one of the most common platform-team mistakes, and Parts 5 and 11 spend their entire arcs untangling it.
Part 10 (SLOs and error budgets) also leans on the distinction: SLOs are pre-defined questions about the service contract, and they belong on the monitoring side. The post-SLO debug — why did the budget burn this hour? — is observability work. The two halves of an SLO programme map cleanly onto the two halves of this chapter, and the burn-rate alerting Part 10 develops is the explicit handoff between them.

The sentence to take into the rest of this curriculum: monitoring detects, observability explains, and the dimensions you can ask about during the next incident are decided in the code that emits the telemetry today. Every later chapter in this curriculum is a consequence of that one decision.

The next time a page wakes you up at 2 a.m., the question to hold in your head is not "which dashboard should I check?" but "which dimensions of this request am I able to slice on right now?" — and if the answer is fewer than the dimensions the bug actually lives in, the post-incident work is to add those attributes to the emission, not to add another panel to the dashboard. The work that actually moves incident-resolution time from hours to minutes is upstream of every dashboard, in the line of code where a span is started and an attribute is attached. That is the line every reader of this chapter should now be able to identify in their own service, and the line every later chapter of this curriculum will, in one form or another, return to.

A final framing for the chapter: every later part of this curriculum can be read as a study of one of the two columns of the table above. Parts 2, 9, 10, and 11 are deep dives into the monitoring column — TSDB internals, dashboards, SLOs, alerting — because that is where the cheapness must come from.

Parts 3, 5, 12, 13, and 14 are deep dives into the observability column — distributed tracing, sampling, eBPF, OpenTelemetry, profiling — because that is where the dimensional flexibility must come from. Parts 4, 6, 7, 8, and 15 connect the two columns at specific failure modes (logs at scale, cardinality budgets, latency tail, compression, production debugging). The chapter you are reading now is the spine that holds the rest of those parts together; if at any later point in the curriculum you find yourself uncertain whether a topic is "monitoring stuff" or "observability stuff", returning to this chapter and asking "is the dimension being decided at emission time or at query time?" will give you the answer.

References

Charity Majors, "Observability — A 3-Year Retrospective" (2019) — the article that fixes the operational definition of observability and explicitly contrasts it with monitoring.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — the foundational text. Chapter 1 ("The Need for Observability") is the canonical reading on this distinction.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapters 1 and 2 make the column-store-vs-TSDB argument that this chapter draws on for cost models.
Google SRE Book — Monitoring Distributed Systems — the definitive reference for what monitoring is for. The four golden signals (latency, traffic, errors, saturation) are the canonical pre-defined questions monitoring answers.
Honeycomb: "Observability is not Monitoring" — the vendor argument, useful for the column-store side of the comparison and for the explicit "monitoring is not deprecated" caveat.
Rudolf Kálmán, "On the General Theory of Control Systems" (IFAC 1960) — the control-theory origin of "observability" as a system property. Useful background for the metaphor's limits.
Cardinality: the master variable — chapter 3 of this curriculum. Cardinality is the budget; this chapter is what you spend it on.
Why "three pillars" is a flawed framing — chapter 2 of this curriculum, which argues the pillar count is wrong; this chapter argues that the role of each pillar (monitoring vs observability) is what matters more than the count.