Metrics, logs, traces: what each is good at
At 21:47 IST during the IPL final, Aditi at Hotstar gets paged: checkout-api p99 latency: 4.8s. The Grafana panel confirms it — the line that lived under 240ms all evening is now hairline-thin and pointing at the ceiling. She has 25 million concurrent viewers and roughly four minutes before the next over starts and the second wave of subscription upgrades hits. The metric tells her something is wrong. It does not tell her which of the eighty microservices in the request path is the cause, and it does not tell her what the failing requests have in common. For those two questions she needs the other two pillars — and if she reaches for the wrong one, she will spend her four minutes reading the wrong dashboard.
This chapter is about why those three signals exist, what each one is actually good at, and which question each one answers cheaply. The hardest part is not learning their definitions; it is internalising that they are not interchangeable, and that cardinality — how many distinct label combinations you ask the system to remember — is the budget that decides which pillar you can afford at which fidelity.
A metric is a number-over-time you decided in advance to keep; a log is a structured event you wrote when something interesting happened in one process; a trace is the span tree of a single request as it crossed processes. Metrics are cheap and answer "is the system healthy"; logs are mid-price and answer "what did one process do"; traces are expensive and answer "where in eighty services did this request break". Cardinality is the variable that decides which of the three blows up your bill first.
The three pillars, side by side
The reason the three signals exist is that no single shape stores everything you want to know cheaply. A metric throws away every per-event detail in exchange for being aggregable across time and dimensions. A log throws away cross-process structure in exchange for keeping every per-event detail in one place. A trace keeps the cross-process structure but only for requests you decided to sample. Each is a lossy projection of the same underlying truth — your service did some work — chosen for a different question.
The temptation, when you first meet the three, is to ask "which one is best?" — as if you would pick one and discard the others. You will not. You will run all three, and the engineering question is how to size each pillar's budget so the combined bill is survivable. Hotstar's checkout-api emits roughly 80 metric series per pod across 400 pods (32K active series), about 4 GB/day of structured logs at peak, and keeps 1% of traces tail-sampled (plus 100% of error traces). Razorpay's UPI payments cluster runs the same shape with different constants — fewer metrics, more logs (every UPI transaction is auditable), comparable trace volume. The shape is universal; the constants are negotiable.
Why the three pillars exist as separate stores rather than one unified store: aggregation has a price. To answer "what was p99 latency for /checkout in ap-south-1 between 21:45 and 21:48 IST?" from raw events would require scanning every event in that window — too slow at 25M concurrent viewers. To answer it from a metric requires reading 3 numbers out of a 1.3-byte-per-sample compressed time series — milliseconds. The metric paid the aggregation cost in advance by deciding, at write time, which dimensions to keep (route, region) and which to discard (user_id, request_id). That decision is irreversible — you cannot ask the metric "which user_id had the worst latency?" because the metric never stored user_id. Logs and traces are the inverse trade: keep per-event detail, pay the scan cost at read time.
What a metric is good at — and what it cannot do
A metric is a numeric measurement attached to a set of labels and recorded over time. The metric http_requests_total{service="checkout-api", method="POST", status="500"} is one time series: a sequence of (timestamp, value) pairs. The labels — service, method, status — are the dimensions you decided in advance to keep. The value is whatever you measured (a counter increment, a current value, a histogram bucket count). A Prometheus server stores millions of such series, each compressed to roughly 1.3 bytes per sample using Gorilla XOR encoding (chapter 8 derives why).
The pillar's strength is aggregability. "p99 of checkout-api latency over the last 5 minutes, split by region" is a query that touches a few thousand samples and returns in 50ms. "Error rate as a fraction of total request rate, alerting if it stays above 1% for 10 minutes" is the canonical SLO query and runs against a metric. You can graph a metric, alert on a metric, and compute burn rates against a metric — all of those are arithmetic on numbers-over-time, which is exactly what a metric is.
The pillar's weakness is per-event identity. The metric http_requests_total knows it counted 142,318 requests; it does not know who made them. If Aditi asks "which customer_id had the most timeouts during the IPL final?", the metric cannot answer — by design. Adding customer_id as a label would turn 32K active series into 32K × (number of distinct customers) = a number large enough to OOM the Prometheus server within minutes. This is the cardinality trap, and chapter 6 in Part 6 dedicates itself to keeping you out of it.
# emit_metrics.py — the simplest end-to-end metric demo with prometheus-client.
# pip install prometheus-client requests
from prometheus_client import Counter, Histogram, start_http_server
from prometheus_client.parser import text_string_to_metric_families
import requests, random, time, threading
REQUESTS = Counter(
"http_requests_total",
"HTTP requests by route, method, status",
["route", "method", "status"],
)
LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["route"],
buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0),
)
def synthetic_load():
# simulate Hotstar checkout-api traffic: ~95% fast, ~4% medium, ~1% slow
while True:
roll = random.random()
if roll < 0.95: latency, status = random.uniform(0.020, 0.120), "200"
elif roll < 0.99: latency, status = random.uniform(0.300, 0.800), "200"
else: latency, status = random.uniform(2.0, 5.0), "500"
REQUESTS.labels(route="/checkout", method="POST", status=status).inc()
LATENCY.labels(route="/checkout").observe(latency)
time.sleep(0.005)
start_http_server(8000)
threading.Thread(target=synthetic_load, daemon=True).start()
time.sleep(3) # let some traffic accumulate
scrape = requests.get("http://localhost:8000/metrics").text
for family in text_string_to_metric_families(scrape):
if family.name in ("http_requests", "http_request_duration_seconds"):
for sample in family.samples:
if sample.name.endswith(("_count", "_sum")) or "+Inf" in str(sample.labels):
print(f"{sample.name}{sample.labels} = {sample.value}")
Sample run on a laptop:
http_requests_total{'route': '/checkout', 'method': 'POST', 'status': '200'} = 595.0
http_requests_total{'route': '/checkout', 'method': 'POST', 'status': '500'} = 6.0
http_request_duration_seconds_count{'route': '/checkout'} = 601.0
http_request_duration_seconds_sum{'route': '/checkout'} = 38.412...
http_request_duration_seconds_bucket{'route': '/checkout', 'le': '+Inf'} = 601.0
What you just saw, line by line. Counter(...) with ["route", "method", "status"] labels declares the dimensions you will keep — every distinct combination becomes its own time series. Histogram(...) with buckets=(...) declares the latency bucket boundaries; Prometheus stores one counter per bucket per label-set, and histogram_quantile() later interpolates p99 from those bucket counts (chapter in Part 7 shows why this interpolation is a lie when buckets are too wide). start_http_server(8000) opens the /metrics endpoint that Prometheus scrapes — your metrics never leave the process until somebody pulls them. text_string_to_metric_families(scrape) is the official OpenMetrics text parser; the same code Prometheus itself uses to ingest scrapes. The output shape — {labels} = value — is the entire metrics-pillar API distilled into ten characters: this is what aggregability looks like on the wire.
Why a histogram and not just an average: at 21:47 IST your average latency might be 60ms while your p99 is 4.8s. The mean is dominated by the 95% of fast requests; the tail is what wakes you up. A histogram preserves the bucket distribution so you can extract any quantile later, at the cost of N counters per histogram (where N is the bucket count) instead of one. Average-only telemetry is the most common observability anti-pattern in junior teams — Part 7 pulls it apart.
What a log is good at — and what it makes you pay for
A log is a structured event emitted from one process when something interesting happened. "Riya's checkout for ₹4,820 succeeded after one retry, trace_id 7f3a..., took 312ms" is a log. Modern logs are JSON, not free-text, because grep on free-text scales to maybe a million events per day; structured logs scale to a billion. The structure is what lets the log store index the fields without parsing every line.
The pillar's strength is per-event ground truth, in one process. When Aditi opens the trace and finds the 4-second gap is on payments-api, her next move is to grep payments-api logs for that trace_id. The log will tell her: "downstream Razorpay sandbox returned 503 at 21:47:03, retried after 1.2s, succeeded after 2.8s, total 4.0s." No metric carries that level of per-event detail. No trace carries the content — what the response body said, what the request payload looked like, what the user-agent string was. Logs are where the actual story of one process's evening lives.
The pillar's weakness is scale. Hotstar at peak emits about 4 million log lines per minute across the fleet — at 600 bytes per JSON line, that is 2.4 GB/minute, 144 GB/hour, 3.4 TB/day at the IPL peak. Storing that for 30 days at S3-warm prices is meaningful money; storing it for 30 days with index is order-of-magnitude more. The control knob is content-cardinality in the indexed labels (Loki indexes only labels, not body — chapter in Part 4 unpacks why) and retention tiers — keep the last 24 hours hot, the last 7 days warm, the last 90 days cold, and the rest only as compliance archive.
# loki_log_demo.py — emit structured logs to Loki, query with LogQL via logcli.
# pip install loguru python-logging-loki requests
import json, time, uuid, random
from loguru import logger
import logging_loki
handler = logging_loki.LokiHandler(
url="http://localhost:3100/loki/api/v1/push",
tags={"service": "checkout-api", "env": "ipl-prod", "region": "ap-south-1"},
version="1",
)
logger.add(handler, serialize=True)
# simulate a flurry of checkout events with structured fields
for i in range(200):
trace_id = uuid.uuid4().hex[:16]
user_id = f"u_{random.randint(10000, 99999)}"
amount = random.choice([149, 299, 499, 4820]) # ₹ subscription / pack
failed = random.random() < 0.02
status = "error" if failed else "ok"
logger.bind(
trace_id=trace_id, user_id=user_id, amount_inr=amount, retry_count=int(failed),
).log("ERROR" if failed else "INFO", f"checkout {status}")
time.sleep(0.01)
# query Loki: only the failed checkouts in the last 5 minutes
import subprocess
out = subprocess.run(
["logcli", "query", '{service="checkout-api", env="ipl-prod"} |= "checkout error"',
"--limit=10", "--quiet"],
capture_output=True, text=True,
)
print(out.stdout[:600])
Sample run after Loki accepted the pushes:
2026-04-25T16:17:54+05:30 {service="checkout-api", env="ipl-prod", region="ap-south-1"}
{"trace_id":"a4f1...","user_id":"u_47812","amount_inr":4820,"retry_count":1,
"level":"ERROR","message":"checkout error"}
2026-04-25T16:17:51+05:30 {service="checkout-api", env="ipl-prod", region="ap-south-1"}
{"trace_id":"7c93...","user_id":"u_18204","amount_inr":299,"retry_count":1,
"level":"ERROR","message":"checkout error"}
3 rows in 18 ms
Walk through the load-bearing lines. tags={"service", "env", "region"} is the labelled stream identity in Loki — only these fields are indexed. Everything else (trace_id, user_id, amount_inr) lives in the JSON body and is searched at query time. logger.bind(trace_id=...) attaches the structured fields to this log line; serialize=True emits the JSON form Loki expects. {service="checkout-api"} |= "checkout error" is LogQL — the part inside {} is a label selector (uses the index), and |= is a content filter (scans, but only over the streams the label selector matched). The cost split — index-narrow first, scan-narrow second — is what lets Loki query 3.4 TB/day at single-digit-second latency without an Elasticsearch-shaped bill.
Why Loki indexes only labels and not log content: indexing every word of every log line is what made Elasticsearch expensive — the index is often larger than the data. Loki bets that operators almost always know which service they want to look at first, so it indexes service/env/region/instance and pays the linear-scan price for everything else. The bet pays off when label cardinality is bounded (~thousands of streams) and breaks when somebody puts request_id or user_id into a label, blowing cardinality into the millions and ruining the index. Part 4 chapter in this curriculum is the full story.
What a trace is good at — and why it is the most expensive
A trace is the span tree of a single request as it traversed processes. Each span is a unit of work — edge-api received POST /checkout, checkout-api called payments-api, payments-api called Razorpay sandbox — with a start time, an end time, a parent span ID, and a bag of attributes. Glue together every span sharing the same trace_id and you get a Gantt chart of the request, with parent spans on top and child spans indented underneath. Hotstar's checkout request at the IPL final fans out across ~80 spans across 12 services; the canonical UPI payment at PhonePe fans out across ~25 spans crossing NPCI hops.
The pillar's strength is cross-process structure. No metric tells you that payments-api was the slow one; the metric only tells you that something upstream of edge-api was slow. No log tells you the relationship between the slow payments-api call and the user-facing /checkout request — the log lives in one process. Only the trace, with its parent-child span structure, makes "which of the 80 calls took 4 seconds?" a one-glance answer.
The pillar's weakness is cost-per-request-kept. A trace at 80 spans × ~200 bytes/span = 16 KB. At Hotstar's IPL peak (200K req/s on edge-api), keeping every trace would mean 3.2 GB/s of trace ingest — Tempo / Jaeger storage costs that nobody has the budget to absorb. So you sample. Head-based sampling (decide at the front door, ~1% kept) is cheap but loses 99% of error traces; tail-based sampling (decide at the collector, after spans are aggregated) keeps all errors plus a sample of successes but requires a stateful collector that holds spans for a few seconds before deciding. Part 5 of this curriculum is dedicated to sampling.
# trace_demo.py — emit a trace from a Flask checkout endpoint to Tempo via OTLP.
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp \
# opentelemetry-instrumentation-flask flask requests
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import time, random, requests
resource = Resource.create({"service.name": "checkout-api", "deployment.env": "ipl-prod"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
def call_payments(trace_amount):
with tracer.start_as_current_span("call_payments") as sp:
sp.set_attribute("amount_inr", trace_amount)
sp.set_attribute("provider", "razorpay")
time.sleep(random.uniform(0.05, 0.30)) # synthetic downstream latency
if random.random() < 0.05:
sp.set_status(trace.Status(trace.StatusCode.ERROR, "razorpay 503"))
return False
return True
with tracer.start_as_current_span("POST /checkout") as root:
root.set_attribute("user.id", "u_47812")
root.set_attribute("amount.inr", 4820)
ok = call_payments(4820)
root.set_attribute("checkout.outcome", "ok" if ok else "failed")
provider.force_flush()
trace_id = format(root.get_span_context().trace_id, "032x")
print(f"emitted trace_id={trace_id}")
# fetch the trace back from Tempo to prove round-trip
time.sleep(2)
r = requests.get(f"http://localhost:3200/api/traces/{trace_id}")
batches = r.json().get("batches", [])
spans = sum(len(b["scopeSpans"][0]["spans"]) for b in batches)
print(f"tempo returned {spans} spans for trace_id={trace_id[:16]}...")
Sample run with a local Tempo running:
emitted trace_id=a4f1b2c8e7d9f01234567890abcdef01
tempo returned 2 spans for trace_id=a4f1b2c8e7d9f01...
The load-bearing pieces. Resource.create({"service.name": ...}) is the OpenTelemetry resource — every span emitted from this process carries these attributes, which Tempo uses to index by service. BatchSpanProcessor + OTLPSpanExporter is the standard SDK pipeline: spans are buffered in memory, batched every ~5 seconds, sent to the OTel collector or directly to Tempo over gRPC (port 4317). start_as_current_span is what makes the span tree happen — the with block pushes the new span onto a thread-local stack, so any further start_as_current_span inside call_payments automatically becomes a child. provider.force_flush() drains the batch buffer before exit; without it, the script ends before the spans leave the process and the trace never lands in Tempo. The GET /api/traces/{trace_id} round-trip is the Tempo HTTP API — proof that you can go from "I just emitted this trace" to "Tempo has it" in code, which is the foundation chapter 3 in Part 3 builds on.
Why traces are the most expensive pillar despite carrying less detail than logs: a single user request becomes 80 separate span records, each with parent-pointer and attribute baggage. The metric for that same request is 1 increment of 1 counter (∼0.5 bytes after compression). The log might be 5 lines × 600 bytes = 3 KB. The full trace is 80 × 200 bytes = 16 KB. The cost ratio per request, roughly, is 1 : 6,000 : 32,000 (metric : log : trace). That ratio is why every production trace pipeline samples — keeping 100% of traces at non-trivial scale is the signature failure of teams that haven't run the math.
When to reach for which — the decision in production
Aditi at 21:47 IST should reach for the metric first (already on her dashboard — confirms the alarm), then the trace (which service is slow?), then the log on the slow service (what was the actual error?). Reaching for them in any other order wastes minutes she does not have. This is the pillar-ordering most production runbooks codify, and it follows from the cost ratio above: cheapest signal first, most expensive signal last, each one narrowing the search space for the next.
The reverse pattern — log-first triage — is the most common failure mode in teams that grew up on a single-binary application. They open Kibana, grep for "ERROR" across all services, get 60,000 results, and start reading. By the time they have narrowed to the right service, the alert has either auto-resolved or the next over has started. Logs answer the question "what did one process do" cheaply only after something else has told you which process to look at. Use them in that order.
There is also a pillar-respecting pattern for the prevention side, not just the triage side: design alerts on metrics, design dashboards on metrics + occasional traces, design audit trails on logs. Mixing these — alerting on a log line, dashboarding from raw traces — works at small scale and breaks at IPL scale. Chapter 11 (Part 11, alerting) and chapter 9 (Part 9, dashboards) make this rigorous.
Common confusions
-
"Logs are just a metric you forgot to aggregate." No. A metric throws away per-event identity at write time and cannot be reconstructed from itself; a log keeps per-event identity at the cost of being read-time-aggregated. You can derive a metric from logs (chapter 4 in Part 4 covers
lokirecording rules), but you cannot derive a log from a metric — the per-event detail was discarded the moment you incremented the counter. -
"Distributed tracing replaces logs." No. A trace span has a 200-byte attribute bag; a log line has 600 bytes of structured content. Real teams emit both — the span is the skeleton of the request (which service called which, in what order, how long each took), and the log is the body on each service (what the response said, what the request payload looked like). Tying them together via
trace_idin the log line is the standard pattern, and it is in chapter 3 of Part 3. -
"High cardinality is a problem only for metrics." False but understandable — Prometheus's cardinality cliff is the most famous version. Logs have content-cardinality (an unbounded
request_idin the log body bloats the index in Elasticsearch but not in Loki, which is why Loki bets on label-cardinality only). Traces have fan-out cardinality — adding 5 more services to the request path turns a 75-span trace into an 80-span trace, and at 200K req/s × 1% sampled, that is +500 KB/s of trace ingest. Cardinality is a budget for all three pillars; the unit changes. -
"OpenTelemetry is metrics + logs + traces in one." OpenTelemetry is a spec and a wire format (OTLP) for all three signals, so you can emit them through one SDK. But the storage for each pillar is still separate (Prometheus / Mimir for metrics, Loki for logs, Tempo / Jaeger for traces). OTel unifies the producer side; the consumer side stays three databases by design, because the three signals have three different access patterns.
-
"If I have all three pillars, I have observability." Telemetry is necessary, not sufficient. Observability is whether you can answer a new question about the system without redeploying — Charity Majors's definition. A team that emits all three pillars but has no
trace_idin the log and noservice.namein the metric still cannot do the metric → trace → log handoff. The pillars are useless unless their identifiers cross-reference, which is why every chapter of this curriculum cares about correlation — the connective tissue — at least as much as the pillars themselves. -
"You can compute p99 from a sample of logs." Only with coordinated-omission discipline. If your sampler drops slow requests (which most head-based samplers do, because slow requests are statistically rarer), the p99 you compute is the p99 of fast requests — useless. Part 7 chapter on coordinated omission and Part 5 chapter on tail-based sampling are the two reads that fix this. The short version: keep 100% of slow / error events even if you sample the rest at 0.1%.
Going deeper
The cardinality budget across the three pillars
Cardinality is the master variable for all three pillars, but the unit differs. For metrics, cardinality is the count of distinct label-value combinations — an active series count, typically billed by your TSDB at ~1.3 bytes/sample × samples-per-day × series-count. Hotstar runs about 32K active series per checkout-api fleet; doubling that doubles the Mimir bill. For logs, cardinality is two things: stream cardinality (Loki labels — service × env × region × instance, typically thousands) and content cardinality (distinct values inside the JSON body, which Loki does not index). For traces, cardinality is spans-per-request × kept-fraction × request-rate — the same engineering knob as metrics' active-series count, just phrased per request.
The non-obvious insight: you can move budget between pillars. If your team is paying ₹X/month for metrics and ₹10X/month for logs because nobody is sampling, the right move is often to drop log retention from 30 days to 7, raise metric cardinality by adding a tenant_id label (now affordable), and reach for logs only when the metric narrowed the search space. Razorpay made exactly this trade in their 2024 platform-team rewrite — they describe it in the engineering blog cited in references.
The "fourth pillar" — continuous profiling
In the last few years, continuous profiling has emerged as a serious candidate for fourth pillar — Pyroscope and Parca are the canonical implementations. A profile is "where in the code (which functions, which lines) did the time go", aggregated continuously at low overhead (~1% CPU). It complements traces: a trace tells you which span was slow; a profile tells you which line within the span was slow. Part 14 of this curriculum is dedicated to continuous profiling. We do not call it a pillar in this chapter only because the production tooling is younger and most teams have not adopted it yet — but it will land in your stack in the next two years if you work on latency-sensitive systems.
Why the pillar boundary is fuzzy at the bottom
If you squint hard enough, all three pillars are append-only event streams with different read indexes. A metric is an event stream where the events are pre-aggregated into time-bucketed counters; a log is an event stream where events are kept verbatim and indexed by labels; a trace is an event stream where events are spans tied together by trace_id. The Prometheus TSDB internally is just an event log with an inverted index on labels. Loki is the same shape with a chunkier compression scheme. Tempo is the same shape with trace_id as the primary index. Charity Majors's "events all the way down" thesis (Honeycomb's design) takes this seriously — store one wide event with many fields, and derive metrics, logs, and trace summaries at read time. It is technically beautiful and operationally hard to bill at petabyte scale, which is why most teams still run three stores. Chapter 17 (the discipline chapter) revisits this debate.
The Indian-scale data point
Razorpay's 2024 SRE blog reports that their alert noise dropped from 1,200 alerts/day to ~70/day after they rebuilt their alerting on multi-window burn-rate (Part 10) — but the first step in the rewrite was disambiguating which of the three pillars each alert was rooted in. Roughly half of the noise was alerts written against logs ("ERROR appeared in payments-api"), which fired on every transient retry; the SLO-burn-rate rewrite re-rooted those on metrics with for: 5m, which automatically dampens transient blips. Same telemetry, different pillar of root, two-orders-of-magnitude difference in pager noise. The pillar choice is not aesthetic; it is on-call sanity.
Reproduce this on your laptop
docker run -d --name prom -p 9090:9090 prom/prometheus
docker run -d --name loki -p 3100:3100 grafana/loki
docker run -d --name tempo -p 3200:3200 -p 4317:4317 grafana/tempo \
-config.file=/etc/tempo.yaml
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp opentelemetry-instrumentation-flask \
loguru python-logging-loki requests flask
python3 emit_metrics.py # then: curl localhost:8000/metrics
python3 loki_log_demo.py # then: logcli query '{service="checkout-api"}'
python3 trace_demo.py # prints a trace_id; fetch from localhost:3200
After running the three scripts, you will have produced one metric scrape, one log stream, and one trace — the three pillars from one laptop, end-to-end, in roughly twelve minutes of setup time. The numbers in your output will not match the article's exactly (your randomness is yours), but the shape should be identical: counters with label-tagged values, JSON log lines with trace_id fields, span trees with parent-child structure. That shape is the thing this whole curriculum unpacks.
Where this leads next
The next four chapters take the same shape and dig deeper into each pillar.
- Why "more telemetry" is not the same as observability — chapter 2, where we draw the line between emitting signals and being able to answer a new question. Charity Majors's distinction, made concrete with examples from Razorpay's pre- and post-rewrite alert lists.
- Cardinality, the master variable — chapter 3, where we make the cost equation for each pillar precise and walk through Flipkart's Big Billion Days cardinality blow-up when somebody added
pincodeas a Prometheus label. - What "events" really are: the unit of telemetry — chapter 4, where we examine the Honeycomb thesis that all three pillars are projections of one underlying event stream and what that means for SDK design.
- The cost equation: what each pillar bills you for — chapter 5, the back-of-envelope sizing exercise for a 100-service fleet at IPL-final scale.
Part 1 ends at chapter 9 with a concrete sizing model for a Hotstar-scale telemetry pipeline; Parts 2 through 4 then drill into each pillar individually (metrics, logs, traces). By the time you finish Part 4 you will have written a meaningful amount of code against prometheus-client, loguru, and opentelemetry-sdk — the article you are reading is the orientation, not the destination.
References
- Distributed Systems Observability — Cindy Sridharan, O'Reilly 2018. Chapter 3 is the canonical statement of the three-pillar model; this article's framing draws directly from it.
- Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda, O'Reilly 2022. Chapters 1–2 argue that "three pillars" is the wrong frame and that events are the underlying unit; the going-deeper section above engages with their thesis.
- Prometheus design: data model and storage — official Prometheus docs. The reference for how
http_requests_total{...}is stored as a time series and what cardinality actually means in the TSDB. - Loki: like Prometheus, but for logs — Grafana Labs, the design blog post that explains why Loki indexes labels and not content.
- OpenTelemetry tracing concepts — official OTel docs. The reference for spans, trace context, and the span-tree shape.
- Scaling payments data engineering pipelines from 1M to 1B events/day — Razorpay engineering blog. Real Indian-scale numbers for the kind of telemetry volume this chapter sketches.
- Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015) — Pelkonen et al., Facebook. The paper that derives the 1.3-bytes-per-sample number used in this chapter; Part 8 of this curriculum unpacks it line by line.
- What makes a data pipeline different from a script — the cross-domain gold-standard chapter on what "running unattended forever" demands. The framing is the same one this observability curriculum extends to telemetry.