Why "three pillars" is a flawed framing (profiles, events, SLOs)

It is 23:14 IST on a Tuesday and Karan, an SRE at Swiggy, is staring at a checkout-api dashboard that is lying to him: the metric line is flat at p99 = 180ms (under the 250ms SLO), the logs show no errors, and the one slow trace he pulled from Tempo is a single 1.8-second span called serialize_response with no children — a black box. The three pillars are doing exactly what the conference talks promised, and he is no closer to the bug than when he was paged. The bug is in a regex inside a JSON serialiser that has been quietly catastrophic-backtracking on one merchant's address field for two weeks; a continuous-profile flamegraph would have shown it in five seconds, and Karan does not have one running in production.

This chapter is about why the "three pillars" mental model — metrics, logs, traces, full stop — is the most over-cited and under-examined idea in modern observability, and what the fourth and fifth pillars (profiles, events) and the organising principle that sits above all of them (SLOs) actually buy you. The framing is not wrong; it is incomplete in a way that bites you on exactly the kinds of incidents you most need to triage cleanly.

"Three pillars" is a useful starting taxonomy but a poor design principle: it tells you which signal shapes exist, not which questions you must be able to answer. Profiles answer "where in the code did time go", events answer "what actually happened to one request end-to-end", and SLOs decide which of the answers you should be paged for. A team that ships metrics + logs + traces and stops there has the inventory but not the discipline.

Why the framing exists, and what it leaves out

The three-pillar model came out of the early-2010s post-microservices crisis, when teams realised that single-binary monitoring tools (Nagios, Munin, statsd-graphite) did not survive contact with a service mesh. Cindy Sridharan's 2017 talk and 2018 O'Reilly book named the three signals — metrics, logs, traces — that any service-of-services would need to emit. The framing did its job: it gave a generation of engineers vocabulary for the SDKs they had to install. OpenTelemetry, Prometheus, Loki, and Tempo are the institutional answer to that talk.

But "vocabulary for the SDKs you install" is not the same thing as "design principle for an on-call system that has to survive the next IPL final". The flaws show up at three places — places where teams that only think in three pillars consistently get burned.

Illustrative — not measured data. The three concrete failures of "metrics + logs + traces, full stop": you cannot see inside the code, you have no native event abstraction, and you have no principled way to decide which alerts are worth waking up for. Profiles, events, and SLOs are the three patches.

Why this matters before we even define the new pillars: each of the three failures above corresponds to a real outage pattern that has cost real Indian companies real money. Karan's regex backtracking at Swiggy cost roughly six hours of degraded checkout before a senior engineer attached py-spy by hand. Razorpay's 2024 alert-noise rewrite (the canonical case study in Part 11 of this curriculum) reduced 1,200 alerts/day to ~70 by re-rooting them as SLO burn-rate alerts. Hotstar's 2023 IPL incident where a single dropped trace correlation made the metric → log handoff impossible cost an estimated 14 minutes of mean-time-to-detect because the team had no event abstraction. None of these three are fixable by adding more metrics, more logs, or more traces.

The fourth pillar: continuous profiling

A profile is "where in the code — which functions, which lines, which call paths — was the CPU spending its time". A flamegraph is the standard rendering: x-axis is the cumulative time spent across samples, y-axis is the call stack depth, and the width of each rectangle is the fraction of total CPU spent in that function. Brendan Gregg's flamegraph is the canonical visualisation; Pyroscope and Parca are the canonical production implementations.

The reason profiling becomes a fourth pillar — not just a development-time tool — is that modern profilers are cheap enough to run continuously in production. Pyroscope's Python agent samples about 100 stack traces per second per process at roughly 1% CPU overhead. At that overhead you can leave it on across the whole fleet, store the profiles for 14 days, and ask "which function consumed CPU on the checkout-api between 23:13 and 23:15 IST" the same way you would ask a metric "what was p99 latency".

A trace tells you which span was slow. A profile tells you which line within the span was slow. The two answer adjacent questions, and there is no projection of one onto the other — a span attribute saying regex_compile_count: 14000 is a custom-instrumentation guess; a profile is the ground truth, captured by sampling the running interpreter.

# pyroscope_demo.py — attach a continuous profiler to a Flask checkout endpoint
# and demonstrate that the profile catches a CPU hotspot the metric misses.
# pip install pyroscope-io flask requests
import re, time, threading, random
import pyroscope
from flask import Flask, request

pyroscope.configure(
    application_name="swiggy-checkout-api",
    server_address="http://localhost:4040",
    sample_rate=100,                 # 100 stack samples/sec
    tags={"env": "demo", "region": "ap-south-1"},
)

# The pathological regex: catastrophic backtracking on input with many 'a's.
BAD_RE = re.compile(r"^(a+)+$")
GOOD_FAST_RE = re.compile(r"^a+$")

app = Flask(__name__)

@app.route("/checkout")
def checkout():
    # 99% of requests: a benign address that matches quickly
    # 1% of requests: a Swiggy merchant whose address triggers backtracking
    addr = request.args.get("addr", "abc123")
    matched = bool(BAD_RE.match(addr) or GOOD_FAST_RE.match(addr))
    return {"ok": matched, "addr_len": len(addr)}, 200

def synthetic_load():
    time.sleep(1)
    while True:
        if random.random() < 0.01:
            addr = "a" * 28 + "!"      # forces catastrophic backtracking
        else:
            addr = "abc" + str(random.randint(100, 999))
        try: __import__("requests").get(f"http://localhost:5000/checkout?addr={addr}", timeout=5)
        except Exception: pass
        time.sleep(0.005)

threading.Thread(target=synthetic_load, daemon=True).start()
app.run(port=5000, threaded=True)

Sample run after letting the script execute for 60 seconds and opening http://localhost:4040 in a browser:

Top 5 functions by self-CPU (last 60s, swiggy-checkout-api):
  re.Pattern.match                            71.4%
  re._compile                                  6.8%
  json.encoder.JSONEncoder.iterencode          4.1%
  flask.Flask.dispatch_request                 3.2%
  werkzeug.serving.WSGIRequestHandler.handle   2.7%

Total samples: 5,847   Sample rate: 100/sec   Overhead: ~0.9% CPU

What you just saw, line by line. pyroscope.configure(application_name=..., sample_rate=100) registers the process with the Pyroscope server and starts a background sampler thread that captures the Python interpreter's call stack 100 times per second — the same mechanism py-spy uses, but pushed continuously to a server instead of dumped to a one-off SVG. tags={"env": "demo", "region": "ap-south-1"} are the equivalent of metric labels; queries on the Pyroscope side filter by these. The pathological regex r"^(a+)+$" is the canonical Python catastrophic-backtracking demo; on a 28-character malicious input the engine explores ~2^28 paths before failing. The synthetic-load thread mixes 99% benign and 1% malicious inputs — exactly the production shape where the metric (p99 latency) is dominated by the 99% and silent about the 1%. The flamegraph the server renders shows the re.Pattern.match rectangle eating 71% of CPU even though only 1% of requests trigger it — the metric does not see this; the profile does.

Why a profile catches what a metric and a trace miss: the metric is an aggregate that has already discarded per-request distinguishability — a histogram bucket count cannot tell you that 1% of requests took 1.8s; you only see the bucket boundary they crossed. The trace, even when sampled tail-based to keep the slow ones, shows a single span serialize_response because there is no instrumentation point inside the regex engine. A profile is sample-based on the interpreter, not on your instrumentation — it sees every Python frame whether you instrumented it or not. This is the property that makes profiles the right tool for "the bug is somewhere I forgot to add a span".

The fifth pillar: events (the wide-row foundation)

An event is one wide structured record describing one unit of meaningful work — typically one request, one job execution, or one user action. A canonical event for Karan's checkout request might have 80 fields: trace_id, user_id, merchant_id, cart_total_inr, payment_method, device_type, region, app_version, ab_test_variant, latency_ms, outcome, retry_count, plus dozens of business-domain fields. Honeycomb's argument — Charity Majors's "events all the way down" — is that this wide row is the underlying data structure, and metrics, logs, and traces are read-time projections of it.

The reason events deserve to be called a pillar separately from logs is that the access pattern is different. A log is read by service and time-window, then grep. An event is read by arbitrary high-cardinality dimensions — "what does p99 look like for merchant_id=swiggy_genie AND region=blr-east-2 AND app_version=4.7.1 in the last 30 minutes?". That query is unanswerable on a metric (cardinality cliff), painful on logs (no efficient index on body fields), and natural on a column-oriented event store. Honeycomb, ClickHouse, AWS CloudWatch Logs Insights, and BigQuery's _PARTITIONTIME-tables all serve this access pattern.

# events_demo.py — emit one wide event per request to a column store (DuckDB
# stands in for ClickHouse / Honeycomb here so the demo runs on a laptop).
# pip install duckdb pandas
import duckdb, json, time, random, uuid
from datetime import datetime, timezone

con = duckdb.connect("events.duckdb")
con.execute("""
    CREATE TABLE IF NOT EXISTS checkout_events (
        ts TIMESTAMP, trace_id VARCHAR, user_id VARCHAR, merchant_id VARCHAR,
        region VARCHAR, app_version VARCHAR, payment_method VARCHAR,
        cart_total_inr INTEGER, latency_ms INTEGER, outcome VARCHAR,
        retry_count INTEGER, ab_variant VARCHAR
    )
""")

def synth_event():
    return {
        "ts": datetime.now(timezone.utc),
        "trace_id": uuid.uuid4().hex[:16],
        "user_id": f"u_{random.randint(10000, 99999)}",
        "merchant_id": random.choice(["swiggy_genie", "instamart", "dineout", "minis"]),
        "region": random.choice(["blr-east-2", "del-north-1", "mum-west-3"]),
        "app_version": random.choice(["4.7.0", "4.7.1", "4.8.0"]),
        "payment_method": random.choice(["upi", "card", "wallet", "cod"]),
        "cart_total_inr": random.randint(99, 4999),
        "latency_ms": int(random.lognormvariate(4.2, 0.8)),  # tail-heavy
        "outcome": "ok" if random.random() > 0.02 else "error",
        "retry_count": 1 if random.random() < 0.05 else 0,
        "ab_variant": random.choice(["control", "v2_layout"]),
    }

# emit 50,000 events
rows = [synth_event() for _ in range(50_000)]
con.executemany(
    "INSERT INTO checkout_events VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
    [tuple(r.values()) for r in rows],
)

# the query a "metric" cannot answer cheaply: p99 latency by merchant × region × app_version
df = con.execute("""
    SELECT merchant_id, region, app_version,
           COUNT(*) AS n,
           quantile_cont(latency_ms, 0.99) AS p99_ms,
           SUM(CASE WHEN outcome='error' THEN 1 ELSE 0 END) AS errs
    FROM checkout_events
    WHERE ts > now() - INTERVAL '5 minutes'
    GROUP BY merchant_id, region, app_version
    HAVING COUNT(*) > 50
    ORDER BY p99_ms DESC
    LIMIT 10
""").df()
print(df.to_string(index=False))

Sample run (one execution on a laptop):

 merchant_id     region   app_version    n  p99_ms  errs
swiggy_genie  mum-west-3      4.8.0   4188  342.0    102
   instamart  blr-east-2      4.7.1   4151  301.0     91
swiggy_genie  del-north-1      4.7.0   4127  293.0     85
   instamart  mum-west-3      4.7.1   4203  287.0     76
swiggy_genie  blr-east-2      4.8.0   4174  281.0     79
     dineout  blr-east-2      4.7.0   4112  274.0     68
   instamart  del-north-1      4.8.0   4185  266.0     74
       minis  mum-west-3      4.7.1   4097  263.0     82
swiggy_genie  blr-east-2      4.7.1   4163  259.0     71
       minis  blr-east-2      4.8.0   4142  254.0     69

The load-bearing pieces. One row per request, with twelve dimensions — a metric with this many high-cardinality labels would explode the active-series count into the millions; a column store reads only the columns the query needs and ignores the rest. quantile_cont(latency_ms, 0.99) over a GROUP BY merchant × region × app_version is the query that proves the point — you slice by three high-cardinality dimensions simultaneously, get a real per-slice p99 (not interpolated from a histogram), and the response time is sub-second on 50,000 rows. HAVING COUNT(*) > 50 is the discipline of not reporting p99 from samples too small to be statistically meaningful — a discipline metric dashboards routinely ignore. The same query on a Prometheus-shape store would require pre-declaring all merchant × region × app_version × … combinations as labels — which is the cardinality trap chapter 3 of Part 1 dedicates itself to.

Why "events" is a pillar separate from "logs", even though both look like JSON records: a log is what one process said happened — typically one log line per noteworthy thing, and many lines per request. An event is what happened to one request, end to end — one wide row aggregated across processes, with all relevant dimensions in one place. Logs are emitted unbidden during execution; events are emitted deliberately at request completion. The query patterns differ accordingly: logs are read by stream + time + content-grep; events are read by arbitrary dimension + aggregate function. Loki is built for the first; Honeycomb / ClickHouse for the second.

The organising principle: SLOs above all pillars

The fourth and fifth pillars patch the data side of the framing. The deeper flaw — the one that decides whether the whole observability stack is paying for itself — is at the alerting and on-call layer, and patching it requires a different abstraction: the SLO.

A Service Level Objective is a contract. "The p99 latency of /checkout will be under 250ms for 99.9% of the time, measured over a rolling 30-day window" is an SLO. The complement of the target — the 0.1% of time you are allowed to be over — is the error budget. The rate at which you are consuming that budget is the burn rate. A multi-window burn-rate alert pages you when both a 1-hour burn rate and a 6-hour burn rate exceed thresholds, which is mathematically equivalent to "you will exhaust your monthly budget in less than a fortnight if this continues". This is not a metric threshold; it is a forecast on a metric threshold, which is a categorically different thing.

The reason SLOs deserve to sit above the pillars rather than alongside them is that the SLO decides which signal you trust at any given moment. If the SLO says "user-facing checkout p99", then serialize_response taking 1.8s matters; an internal cache miss on a non-critical-path service does not. The same metric — internal_cache_miss_total — is page-worthy in a system without SLOs ("threshold crossed!") and unactionable noise in a system with SLOs ("not in the SLI numerator, ignore"). Razorpay's 2024 rewrite found that ~94% of their pre-rewrite alerts fell into the second category once SLOs were drawn.

Illustrative — not measured data. The shape that replaces "three pillars": SLOs sit on top as the organising contract, five pillars sit beneath as the projections of the underlying telemetry stream. Each pillar answers a different question, and the SLO decides which question is worth a page.

The practical consequence: a team operating with this frame writes alerts on the SLO layer (multi-window burn rate on the SLI), uses metrics to compute the SLI, and reaches for logs / traces / profiles / events only during triage — never as the source of an alert. This is the discipline Part 10 (SLOs) and Part 11 (alerting) of this curriculum spend twenty chapters on. The single most common failing of a "we have three pillars" team is alerting on raw metric thresholds — cpu_usage > 80%, error_count > 10 — which fires regardless of whether any user is actually feeling pain.

Edge cases the framing still gets wrong

Even the five-pillar-plus-SLO frame leaves three real gaps that production teams hit. Naming them up front saves the chapter from sounding triumphal.

The first gap is client-side telemetry. Every pillar above lives in your backend. The user clicking Pay on the Hotstar mobile app at 21:47 IST is producing telemetry too — render times, JS errors, crash reports, network timing — and that telemetry is what actually reflects what the user feels. Real-User-Monitoring (RUM) is not a sixth pillar so much as the same five pillars on the client side, with separate cost and privacy constraints (you cannot keep PII; you have a 4G upload budget). Every Indian app-first team — Flipkart, Swiggy, Cred, Dream11 — runs a parallel RUM stack and stitches it to the backend by trace_id. The official OTel JavaScript SDK has stabilised; treat client-side as part of the same observability program, not a separate one.

The second gap is business metrics. Razorpay cares about payment_success_rate per merchant_id per payment_method; Zerodha cares about orders_placed_per_user per instrument_segment; Swiggy cares about cart_to_order_conversion_rate per city × cuisine. None of these are infra metrics; all of them are SLI candidates. The frame above accommodates them inside the events pillar (wide rows with business dimensions) but does not by itself force you to instrument them. The discipline of treating business outcomes as first-class SLIs is what Part 17 of this curriculum eventually argues for.

Why these edge cases matter even with the five-pillar frame: the gap is not that the pillars are missing the data, it is that the organisational habit of treating "observability" as an SRE-team concern leaves client-side and business-domain telemetry orphaned. The Indian teams who got this right (Razorpay's payment-method dashboards, Zerodha's market-hour SLIs) did so by making the SLO list itself include business outcomes — at which point the existing pillars carry the data automatically.

The third gap is emergent dependencies you did not instrument. eBPF (Part 12 of this curriculum) is the answer: kernel-level observability that requires no application change, and that catches the cross-process and cross-syscall behaviour your SDK-installed pillars cannot see. eBPF is sometimes called the sixth pillar; it is more honest to call it the substrate on which the other five increasingly run, because BCC / bpftrace / Cilium Hubble / Pixie are how the other pillars get cheaper and lower-overhead over time. The frame absorbs eBPF as an implementation detail of how the pillars are gathered, not as a separate question to be answered.

Common confusions

"Profiles are just a development tool." False — they were until ~2021 and they are now production-grade. Pyroscope and Parca run at ~1% CPU overhead, store profiles for weeks, and are queryable by service × time × tags the same way Prometheus is queryable. If your stack is Java / Go / Python / Node and you do not have a continuous profiler, you are flying blind on every CPU-bound bug. Part 14 of this curriculum covers the production deployment.
"Events are the same as logs." They overlap, but the access pattern differs. A log is many records per request, indexed by service and time, grep-able by content. An event is one wide record per request, queryable by any dimension at any cardinality. Honeycomb, ClickHouse, and BigQuery serve events; Loki and Elasticsearch serve logs. A team can choose to store everything as events and project logs out of them at read time (the Honeycomb thesis), but the underlying store has to be column-oriented for the projections to be cheap.
"SLOs are just metrics with thresholds." Misleading. A metric threshold (p99 > 250ms) fires on every transient blip; an SLO is a budget that allows transient blips and pages only when the rate of consumption threatens to exhaust the budget over the SLO window. Mathematically, multi-window-multi-burn-rate alerts use two burn rates over two time windows specifically to filter out one-off spikes. The Google SRE book chapter on practical alerting derives this from first principles; Part 10 of this curriculum reproduces the math.
"OpenTelemetry is for the three pillars." The OpenTelemetry spec covers metrics, logs, and traces explicitly. Profiles are an in-progress signal type (the OTel profiles spec was stabilised in 2024). Events have no first-class signal in OTel — they are typically encoded as wide spans with many attributes, or as logs with rich JSON bodies, depending on the destination store. The framing is catching up with the practice; the practice is ahead of the spec.
"Five pillars is just three pillars plus two more." No — the organising principle changes. Three pillars is a shopping list; five pillars + SLOs is a frame where SLOs decide which signal you trust for which question, and the additional two pillars (profiles, events) close real blind spots. Adding more pillars without the SLO discipline produces a team that has more telemetry and the same on-call pain.
"Three pillars is wrong, so I should ignore it." Also no. Metrics, logs, traces are still the three SDK shapes you ship, and Parts 2–4 of this curriculum are dedicated to them. The three-pillar framing is incomplete, not incorrect — the chapters of this curriculum that follow this one fix the gaps without throwing away the foundation.

Going deeper

Where the "three pillars" name actually came from

The phrase predates Cindy Sridharan's 2017 talk by a few years — it shows up in early Honeycomb, Datadog, and New Relic marketing decks circa 2014–2015, where each vendor was selling whichever pillar they shipped most. By the time Sridharan codified it in the 2018 O'Reilly book, it had become the de facto vocabulary for the post-microservices observability conversation. Charity Majors's 2022 Observability Engineering book (with Liz Fong-Jones and George Miranda) explicitly argues that the framing was wrong from the start — Honeycomb's bet on events as the underlying primitive predates the three-pillar framing and was always its competitor. Reading Sridharan's book and Majors's book back-to-back is the fastest way to internalise the debate.

Why "telemetry" and "observability" are different words

Telemetry is the signal you emit. Observability is the property of being able to answer a new question without redeploying. A team that emits metrics, logs, and traces but cannot pivot to a question they did not pre-instrument has telemetry, not observability. The high-cardinality dimensional access that events enable is the operational reason the distinction matters: only a column-oriented event store lets you ask "p99 by merchant × region × app_version" without having pre-declared the cross-product as labels. If you cannot run that query right now on your production data, you have a telemetry problem masquerading as an observability program. The Razorpay 2024 platform-team rewrite quotes this distinction verbatim in their internal post — it is the most cited paragraph of Majors's book inside Indian SRE circles.

Profiles vs traces — a worked example from Zerodha

Zerodha Kite's order-placement endpoint is the canonical Indian-trading SLO example: p99 ≤ 200ms, measured between 09:15 IST and 15:30 IST IST trading hours, sub-millisecond clock discipline. In 2023 they hit a regression where p99 climbed from 180ms to 240ms over a fortnight without any new deploys. The trace showed a single span db_round_trip that was the slow one — but the database was healthy. The root cause was a Python dict whose iteration order in CPython 3.11 had been changed by a security patch, causing a hash-collision pattern that turned one O(1) lookup into an O(n) scan in a hot loop inside the database client library. No span instrumented the dict lookup; no log mentioned it; no metric counted it. The 30-second flamegraph that finally found it showed dict.__getitem__ eating 22% of CPU on the order-placement path. This is exactly the bug class profiles exist for.

Where SLOs come from — the Google SRE distillation

The SLO framework as Indian platform teams use it today is descended from the Google SRE practice codified in Site Reliability Engineering (2016) and The Site Reliability Workbook (2018), specifically the chapters on Service Level Objectives and Practical Alerting. The two-window two-burn-rate alert formula — error_rate > 14.4 × budget over 1h AND error_rate > 6 × budget over 6h pages immediately — comes directly from the SRE workbook and is the form Razorpay, Flipkart, and Hotstar have all converged on after their respective alert-noise-reduction rewrites. The numbers are not arbitrary: 14.4 = 2% of a 30-day budget burned in 1 hour; 6 = 5% of a 30-day budget burned in 6 hours. Part 10 of this curriculum derives the formula step by step.

Reproduce this on your laptop

docker run -d --name pyroscope -p 4040:4040 grafana/pyroscope
python3 -m venv .venv && source .venv/bin/activate
pip install pyroscope-io flask requests duckdb pandas
python3 pyroscope_demo.py    # then open http://localhost:4040
python3 events_demo.py        # writes events.duckdb, prints the dimensional p99

After running both scripts, you will have produced one continuous-profile flamegraph and one column-oriented event table with 50,000 wide rows, each queryable by 12 dimensions. The flamegraph will visibly show re.Pattern.match as the widest rectangle on the left side of the CPU axis — that is the bug a metric and a trace both miss. The event-table query will return per-merchant × region × app_version p99 values that no Prometheus deployment with reasonable cardinality limits could compute.

Where this leads next

The next chapters in Part 1 deepen the honest-framing thread this chapter started.

Cardinality: the master variable — chapter 3, where we make the cardinality budget for metrics, logs, traces, profiles, and events concrete and per-pillar, then walk through the Flipkart Big Billion Days incident where adding pincode as a Prometheus label exploded active-series count by 28,000×.
The observability-vs-monitoring distinction — chapter 4, the line between "I can read my dashboards" and "I can answer a new question". This is where Charity Majors's definition lands rigorously, with worked examples from Indian payment systems.
What "events" really are: the unit of telemetry — chapter 5 in a later section, taking the wide-row idea from this chapter and showing the Honeycomb / ClickHouse / BigQuery design choices that make it tractable at scale.

Part 14 of this curriculum, fourteen chapters from now, is dedicated to continuous profiling — Pyroscope internals, on-CPU vs off-CPU profiling, sampling rate vs overhead, and the production patterns Indian teams have converged on. Part 10 (SLOs) and Part 11 (alerting) together cover the burn-rate formulation that makes the SLO layer a real on-call discipline rather than a vocabulary upgrade.

References

Distributed Systems Observability — Cindy Sridharan, O'Reilly 2018. Chapter 3 codifies the three-pillar framing this chapter critiques; reading it remains the right starting point.
Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda, O'Reilly 2022. Chapters 1–2 argue that events are the underlying primitive and that "three pillars" was always incomplete; this chapter's framing is the synthesis.
Site Reliability Engineering — Google, the canonical SLO chapter. Multi-window burn-rate alert math originates here.
Pyroscope: continuous profiling for distributed systems — Grafana Labs documentation. The reference for the production-grade fourth-pillar implementation used in the worked example.
Honeycomb: how we think about observability — Charity Majors's manifesto. The events-as-primitive thesis that gives us the fifth pillar.
Brendan Gregg, Flame Graphs — the canonical flamegraph documentation. The visualisation every continuous profiler renders is descended from this work.
Razorpay engineering: scaling payments observability — the engineering blog where the 1,200-alerts-per-day to ~70-alerts-per-day rewrite is documented; the case study Part 11 of this curriculum returns to repeatedly.
Metrics, logs, traces: what each is good at — the previous chapter in this curriculum, which establishes the three-pillar baseline this one extends.