Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The four golden signals

It is 19:47 IST on a Tuesday and Karan, an SRE at a hypothetical Bengaluru-based food-delivery platform that we will call ZayikaGo, is staring at a Grafana dashboard with eighty-two panels on it. The dashboard is titled checkout-api — full health and it has a heatmap of JVM garbage-collection pauses, a histogram of TCP retransmits, a stacked area chart of thread-pool occupancy, a node_exporter panel for every one of forty-seven Kubernetes nodes, and a single small panel in the bottom-right corner labelled 5xx rate. The 5xx panel is dark red. It has been dark red for nine minutes. Nobody noticed because the eye lands on the heatmap first and the heatmap is calm. Customers have been getting Order failed, please try again since 19:38. By 19:51 the founder is in the war-room channel. By 20:10 the on-call team has rolled back the deploy that caused it. The post-mortem will conclude that the dashboard had every signal an engineer could want and none of the four signals that would have let anyone find the problem in under sixty seconds.

The four golden signals — latency, traffic, errors, saturation — are the four user-visible numbers that, projected from any request-handling service, tell you whether the service is fine, broken, getting hammered, or about to fall over. The discipline is not adding them to a dashboard; it is removing the eighty-two other panels that drown them. A dashboard that shows only the four golden signals at the top, larger than everything else, produces faster diagnoses than one that shows everything an engineer might want, because the four signals are what the user feels and the rest are explanations of why.

What "golden" means and why these four

The four signals come from chapter 6 of Google's Site Reliability Engineering book, written by Rob Ewaschuk in 2016, and they are the SRE-team-lead distillation of a decade of running web services at planet scale. Ewaschuk's claim was sharper than the framing usually credited to him: if you can monitor only four things on a user-facing system, monitor these four, because every other signal is either a leading indicator of one of these four or a downstream consequence of one of them. The framing is not "here are good signals to add to your dashboard"; it is "here are the four signals that are load-bearing, and if your dashboard hides them under fifty panels of CPU and GC graphs, your dashboard is wrong".

Latency is how long a request takes when it succeeds. Traffic is how many requests are arriving — typically requests-per-second, or for non-request-driven systems, throughput in items-per-second. Errors is the rate of failed requests, broken down by failure mode (5xx, business-logic-rejected, timed-out without an HTTP code at all). Saturation is how full the service's most-constrained resource is, expressed as a fraction of capacity — the queue depth at 80%, the connection pool at 95%, the disk I/O at 70% utilisation. Each of these four answers a question a user, a product manager, or a CFO asks in plain language. Latency answers "is the app slow?". Traffic answers "is anyone using it?". Errors answers "is it broken?". Saturation answers "is it about to break?".

Illustrative — the four signals as Ewaschuk framed them in the Google SRE book, with the user-facing question each answers. Each signal is independent: latency does not subsume errors (a fast 500 is still a 500), traffic does not subsume saturation (a service can saturate at low traffic if a downstream is slow), and saturation does not subsume latency (a half-saturated thread pool can still produce p99 spikes if the work distribution is bad).

The "user-visible" qualifier is doing real work. The signals are framed from the user's perspective because the user is the only consumer whose experience the service exists to serve, and any signal that does not connect to the user's experience is, at best, a diagnostic input — useful for the engineer who has already noticed something is wrong, useless for the engineer who is trying to notice. JVM heap usage is a diagnostic input; the user does not feel heap usage directly. They feel the latency spike that happens when GC pauses lengthen because the heap is too full. The golden signals discipline says: alert on the latency spike, treat the heap as a drill-down. The eighty-two panels on Karan's dashboard reverse the relationship — they showcase the diagnostic inputs and bury the user-visible signal — and that reversal is the bug.

Why "user-visible" rules out CPU and memory as primary signals: a service running at 95% CPU is not a problem if its latency is fine and its error rate is zero — the user does not care about CPU, they care about whether their order went through and how long it took. CPU is at best a leading indicator of saturation, but only when the CPU happens to be the constrained resource. On a service whose constrained resource is the database connection pool, CPU at 95% is irrelevant and the connection-pool occupancy at 90% is the real saturation signal. Naming "saturation" instead of "CPU" forces you to identify which resource is actually the bottleneck per service, rather than reflexively graphing every infrastructure metric.

There is a reason the list has exactly four entries and not five or three. Three is too few — Brendan Gregg's USE method (Utilisation, Saturation, Errors) is great for resources but misses traffic, which is what tells you a silent failure has occurred when requests stop arriving. The Tom Wilkie RED method (Rate, Errors, Duration) covers requests but misses saturation, which is the leading indicator that distinguishes "we are fine" from "we are about to not be fine in 90 seconds". The four golden signals are the union — they cover the request side (latency, traffic, errors) and the resource side (saturation), and the union is irreducible. Each subset misses something the others catch. Part 12 of this curriculum will spend a chapter on each method's strengths; this chapter is about why the four together are the floor.

Five is too many for a different reason. Each additional signal beyond four halves the attention each panel receives on a leadership dashboard's 15-second budget — five panels at 3 seconds each is below the threshold at which a non-technical reader can form a stable interpretation of any panel. The book's authors did not arrive at four by counting; they arrived at four by removing every signal that did not earn its place against the others. Cost-per-request was considered and rejected because it is a downstream consequence of traffic and infrastructure, not a user-facing signal. Cache hit rate was considered and rejected because it is a downstream explanation of latency. Concurrency was considered and rejected because it is the same as saturation expressed differently. Goodput — successfully-completed business transactions per second — was the closest contender for a fifth signal, and was ultimately folded into errors as (traffic - errors). Four is the largest set where each entry pulls its weight at the leadership boundary; adding a fifth dilutes the others without adding diagnostic capacity that the four together do not already provide.

The signals are also chosen for independence — moving one does not necessarily move the others, which is what makes them useful for diagnosis. A latency spike with no error spike usually points to a slow downstream or a thread-pool contention. An error spike with no latency change usually points to a deploy of broken business logic that fails fast. A traffic drop with no other signal moving is the worst kind — it usually points to a load balancer health-check that just decided your service is unhealthy and stopped sending traffic. A saturation rise with no immediate latency change is the calm before the storm — the queue is filling, the pool is filling, and you have ninety seconds before requests start timing out. The independence means each of the four panels is a different question the dashboard is asking.

Independence is not the same as orthogonality, and the difference matters. The four signals are operationally independent — knowing three of them does not let you predict the fourth — but they are causally coupled — moving one will eventually move the others if the underlying problem is not fixed. A saturation rise that goes unattended will eventually produce a latency spike (queues fill, requests wait); a latency spike that goes unattended will eventually produce errors (timeouts fire); an error spike that goes unattended will eventually produce a traffic drop (clients give up and stop retrying). The temporal sequence — saturation, then latency, then errors, then traffic — is the cascade order, and a four-signal dashboard read as a story across time tells you both what is happening now and how far through the cascade the service is. A team that sees only a saturation rise has 90 seconds; a team that sees latency and errors moving together has perhaps 30 seconds; a team that sees traffic dropping has already lost the user. The cascade order is what makes saturation the most diagnostically valuable of the four — it is the earliest signal, the only one that buys time.

Before the SRE book named four signals, most production-engineering teams at Google, Amazon, and Yahoo were graphing somewhere between fifteen and forty signals on their dashboards — a mix of CPU, memory, GC pauses, network throughput, error rates, request rates, queue depths, thread counts, and business-specific health metrics. The dashboards worked in the sense that all the data was visible; they failed in the sense that no two engineers could agree on which panel to look at first when something went wrong, and post-mortems consistently identified "we missed it because we were looking at the wrong panel" as a contributing factor. The 2014–2016 internal SRE-book working group's contribution was not to invent new signals — every one of the four was already being measured at every large web operation — but to force a ranking. The ranking made the four "golden" because everything else was now explicitly secondary, and the secondary-status was the thing that bought the dashboard reading-time it had not previously had.

The ranking also produced a useful side-effect: it made dashboards teachable. Before the four-signal canon, on-boarding a new SRE to a service required walking them through the team's idiosyncratic dashboard panel-by-panel — often a 90-minute session covering every metric the team had decided was worth graphing. After the four-signal canon, the on-boarding session compressed to 15 minutes — "these four panels are the user-visible signals; everything below is a drill-down; here is which drill-down corresponds to which signal" — and the new SRE was operationally productive on day one rather than week three. The pedagogical tractability of the four-signal frame is a property the underlying signals do not have on their own; it emerges only from the act of ranking and naming.

Measuring all four from a Python service in twenty lines

The signals are abstract until you emit them. Here is the smallest realistic Python service that exposes all four, instrumented with prometheus-client. The service is a stand-in for a checkout-API endpoint; the load generator hits it for thirty seconds with a 5% synthetic-error rate; the /metrics scrape at the end shows what the dashboard would see.

# golden_signals_demo.py — a Flask service exposing all four signals
# pip install flask prometheus-client requests
import random, threading, time
from flask import Flask, jsonify, abort
from prometheus_client import Counter, Histogram, Gauge, start_http_server
from prometheus_client import CONTENT_TYPE_LATEST, generate_latest
import requests

app = Flask(__name__)

# 1. TRAFFIC + 3. ERRORS — a single counter, broken down by status
REQS = Counter("checkout_requests_total",
               "checkout-api requests by status and method",
               ["method", "status"])

# 2. LATENCY — a histogram of successful-request duration
LAT = Histogram("checkout_request_duration_seconds",
                "checkout-api success latency",
                ["method"],
                buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5])

# 4. SATURATION — gauge for in-flight requests vs the worker-pool capacity
WORKER_POOL_CAP = 32
INFLIGHT = Gauge("checkout_inflight_requests",
                 "requests currently being served")
SATURATION = Gauge("checkout_saturation_ratio",
                   "in-flight / worker-pool-capacity, 0..1")

@app.route("/checkout", methods=["POST"])
def checkout():
    INFLIGHT.inc()
    SATURATION.set(INFLIGHT._value.get() / WORKER_POOL_CAP)
    start = time.perf_counter()
    try:
        if random.random() < 0.05:           # 5% synthetic 500s
            REQS.labels("POST", "500").inc()
            abort(500)
        time.sleep(random.expovariate(1/0.04))  # mean 40ms latency
        REQS.labels("POST", "200").inc()
        LAT.labels("POST").observe(time.perf_counter() - start)
        return jsonify(ok=True)
    finally:
        INFLIGHT.dec()
        SATURATION.set(INFLIGHT._value.get() / WORKER_POOL_CAP)

if __name__ == "__main__":
    start_http_server(8001)                   # /metrics on :8001
    app.run(port=8000, threaded=True)         # /checkout on :8000

After thirty seconds of wrk2 -t4 -c32 -R200 -d30s -s post.lua http://localhost:8000/checkout, scraping /metrics shows:

# Sample run output (curl localhost:8001/metrics | grep checkout_)
checkout_requests_total{method="POST",status="200"} 5687
checkout_requests_total{method="POST",status="500"} 313
checkout_request_duration_seconds_bucket{method="POST",le="0.05"} 4612
checkout_request_duration_seconds_bucket{method="POST",le="0.1"} 5421
checkout_request_duration_seconds_bucket{method="POST",le="0.25"} 5681
checkout_request_duration_seconds_bucket{method="POST",le="+Inf"} 5687
checkout_request_duration_seconds_count{method="POST"} 5687
checkout_request_duration_seconds_sum{method="POST"} 234.187
checkout_inflight_requests 27.0
checkout_saturation_ratio 0.84375

Walking through the load-bearing lines: REQS = Counter(... ["method", "status"]) is the single counter that produces both traffic (sum across statuses) and errors (status="500" divided by total). One counter, two signals — the SRE-book convention is to never split request-rate and error-rate into separate metrics because their ratio is the error rate and you want PromQL to compute it without needing to join two series. LAT = Histogram(...) is the latency signal, but only on success — the LAT.observe() call sits inside the success branch, and the 500 branch never observes a duration. Why latency is success-only: a 500 that fails in 2ms is not a "fast request"; it is a fast failure. Mixing failures into the latency histogram makes p99 look better as your error rate climbs, which is the inverse of the truth — you want the p99 panel to deteriorate as the service breaks, not improve. The success-only convention is the one mistake every team makes once and then never again. INFLIGHT.inc() ... INFLIGHT.dec() in the try / finally is the saturation signal: the gauge tracks how many requests are currently in flight, and SATURATION divides by capacity to produce a 0..1 ratio that PromQL can alert on with a simple threshold. bucket=[0.005, 0.01, ..., 5] are the histogram bucket boundaries; ten buckets is the default for SRE-book-style p99 estimation, and the boundaries are roughly logarithmic so the interpolation error stays bounded across two orders of magnitude. The wrk2 (not wrk) call is critical — the -R200 flag specifies a constant arrival rate of 200 requests per second, which avoids the coordinated-omission bias that wrk's open-loop default produces.

A subtle property of the instrumentation: all four signals come from a single metric instrument plus a single gauge, not four independent metrics. Traffic and errors share the checkout_requests_total counter, distinguished only by the status label; latency uses the checkout_request_duration_seconds histogram with LAT.observe() called only on success; saturation uses the checkout_inflight_requests gauge with the checkout_saturation_ratio derived gauge tracking it in real time. The compactness is not aesthetic — it is operational. A service that emits the four signals from four independent metrics has four chances to drift, four chances for a renamed label to break a panel, four chances for a metric to be silently dropped. A service that emits the four signals from a tightly-coupled set of two metrics has only two such chances, and the coupling is enforceable at code-review time. Most production services that have lived for two years end up with the compact instrumentation pattern even if they did not start there, because the alternative is operationally too fragile.

The PromQL queries that turn these into the four golden-signal panels are:

# 1. Traffic — requests per second
sum(rate(checkout_requests_total[1m]))

# 2. Latency — p99 of successful requests
histogram_quantile(0.99,
  sum(rate(checkout_request_duration_seconds_bucket[1m])) by (le))

# 3. Errors — error rate, expressed as a fraction
sum(rate(checkout_requests_total{status=~"5.."}[1m])) /
sum(rate(checkout_requests_total[1m]))

# 4. Saturation — most-recent ratio of in-flight to capacity
checkout_saturation_ratio

Why these four PromQL queries are short and the rest of your dashboard is long: a service's user-visible health is genuinely a four-dimensional summary. Everything else — the GC heatmap, the per-pod CPU graph, the connection-pool histogram by upstream — is a drill-down the engineer reaches for after one of the four panels turns red. The hierarchy of dashboards (golden-signals at the top, drill-downs one click away) is what makes the four signals load-bearing rather than just four-of-many.

Why each signal has a failure mode the other three miss

The independence claim earlier deserves a real example. Consider four scenarios, each of which is a real-shape outage that one of the four signals catches and the other three miss.

Scenario one — the stuck downstream. Riya, an SRE at a hypothetical Mumbai-based travel booking site we will call YatraNow, is paged at 11:30 IST. The payments-api service has a p99 latency of 4.8 seconds, up from a baseline of 240ms. Error rate is zero — every request is succeeding. Traffic is steady at 800 rps. Saturation is at 35%, well within budget. The latency signal is the only one that moved. The cause is that NPCI's UPI endpoint, which payments-api calls synchronously for transaction validation, has been slow for the last 14 minutes; YatraNow's service is patiently waiting and eventually getting back valid responses. A dashboard without the latency panel — one that shows only error rate, traffic, and saturation — would have shown three calm panels and missed the user-facing 4.8-second checkout. The latency panel is the only one that catches "the service is technically working but the user is suffering".

Scenario two — the silent traffic drop. At 14:30 IST on a Wednesday, the merchant-onboarding service of a hypothetical Pune-based payments aggregator we will call NeoSetu drops from 120 rps to 4 rps over the course of three minutes. Latency is unchanged — the four remaining requests per second complete in 80ms. Error rate is zero — none of the four are failing. Saturation drops to 2% because there is almost no work. The traffic signal is the only one that moved. The cause is that the public-facing AWS Application Load Balancer in front of the service had its health-check timeout reduced from 5s to 2s as part of an unrelated terraform change, and on a slow-cold-start endpoint, 30% of pods now fail health-checks and get pulled from rotation; the remaining pods see the same baseline traffic, and the load balancer is silently 503-ing 96% of incoming requests at the edge before they ever hit the service's metrics. The service's own dashboard sees nothing wrong because, from its perspective, the requests it does serve are healthy. The traffic-drop alert is the only signal that catches the upstream-rejected-everything failure mode.

Scenario three — the fast-failing deploy. Karan, the ZayikaGo SRE from the lead, deploys a refactor of checkout-api at 19:38. The refactor introduces a null-pointer bug on a code path that fires for users without a saved address — about 8% of orders. Those 8% of requests now return 500 in 12ms, much faster than the typical 40ms successful path. Latency p99 actually improves from 110ms to 92ms because the failures are dragging the success-only histogram's tail down via the inverse mechanism (some users retry quickly, briefly raising traffic and the success rate). Traffic is steady. Saturation is fine. The error rate jumps from 0.1% baseline to 8%. The error-rate signal is the only one that catches "the deploy is broken in a specific code path". A latency-and-saturation-only dashboard would have shown a beautifully improving service while users in the affected segment churned away from the app.

Scenario four — the queue filling silently. Aditi, an SRE at a hypothetical Hyderabad-based logistics platform we will call CargoPath, watches the route-optimiser service drift from 60% queue occupancy to 92% queue occupancy over a 20-minute window. Latency is steady at p99 = 480ms — well within the 2-second SLO. Error rate is zero. Traffic is at 200 rps, baseline. Only saturation moved. The cause is that one downstream rerouting service is slowly degrading, taking longer per item, but each individual response still returns within the 2-second timeout; the queue is filling because the throughput-per-worker is dropping. Without the saturation panel, the team would not see anything wrong until queue occupancy hit 100%, at which point new requests would start timing out and latency and error rate would both spike together. The saturation signal is the leading indicator that buys 90 seconds of warning before the failure is visible to the user. With it, Aditi can re-route the optimiser to a fallback before the timeout cliff; without it, the dashboard is a calm green field until it is suddenly red.

The four scenarios are not edge cases; each maps to a category of outage that a real Indian-scale service team will see multiple times per quarter. The stuck-downstream pattern is what NPCI slowness produces on every UPI-dependent service in the country; the silent-traffic-drop pattern is what every team behind an Application Load Balancer encounters at least once when terraform changes touch health-check timings; the fast-failing-deploy pattern is what every team that ships features daily produces on roughly 1–2% of deploys; the queue-filling-silently pattern is what every queue-driven service produces when one of its downstreams degrades gradually rather than failing cleanly. A team that has lived through all four — most teams reach this milestone within their first eighteen months at scale — internalises the four signals not because the SRE book recommended them but because they have personally watched each subset of three lead to a missed outage. The list is canonical because the failure modes are canonical.

Illustrative — not measured data. Each row is a distinct failure mode where exactly one of the four signals goes red. The ALB-health-check row is the most counter-intuitive: traffic dropping is bad, but a service watching only its own metrics misses it because, from inside, the few requests that arrive look healthy. The four signals together force you to monitor the service's interface to the world (latency, traffic, errors) and its internal headroom (saturation), which is what makes them irreducible.

The pattern across the four scenarios is that each signal answers a different physical question. Latency answers "what is the user feeling on a successful request?". Traffic answers "is the service even being reached?". Errors answers "what fraction of attempts are not making it through?". Saturation answers "how much headroom is left before the next failure?". Drop any one of the four and a class of outage becomes invisible until it cascades into one of the others — which usually means until the user has already felt it.

A common failure mode in teams that have added the four-signal panels but not internalised the four-signal discipline is to treat the four panels as a status indicator rather than a diagnostic frame. The four panels are green; therefore, the service is fine. This reading misses the cascade structure entirely — saturation can be at 78% (green by the team's threshold) while the queue is filling toward 100% and the team has minutes rather than hours of headroom. The four signals are diagnostic trajectories, not point-in-time statuses; reading them as statuses produces the same false comfort as reading a thermometer that is rising at 0.5°C per minute and concluding "the temperature is fine because it is currently 38°C". A team that has read the SRE book chapter and not also internalised that the panels are read as derivatives — the rate of change matters as much as the absolute level — is operating at the next-floor-down of the discipline.

How the four signals shape the rest of the dashboard

The eighty-two-panel checkout-api — full health dashboard is not wrong because it has eighty-two panels. It is wrong because the four golden-signal panels are not at the top, are not visually larger than everything else, and do not anchor the reader's first 15 seconds. The fix is structural: a leadership-and-on-call dashboard puts the four signals at the top in a 4×1 row, each panel large enough to read from across the room, with the rest of the eighty-two panels arranged below as drill-downs and grouped by which of the four signals they explain. CPU, memory, GC pause durations, and thread-pool depth go under saturation. Per-endpoint latency histograms and downstream-call-duration heatmaps go under latency. 5xx breakdowns by error code, business-logic-rejection rates, and timeout counts go under errors. Per-pod request rates and per-region traffic breakdowns go under traffic.

The structural choice has a measurable consequence. Karan's team rebuilt the checkout-api dashboard after the 19:47 incident with the four signals at the top and the rest below; the post-deploy verification step now consists of "stare at the four panels for 30 seconds; if any of them moves more than 10% from its 24-hour baseline, roll back". The new dashboard's mean time to detect a bad deploy dropped from nine minutes to forty seconds across the next four post-mortem-eligible deploys, because the four panels were the first thing every on-call engineer's eye landed on. The eighty-two diagnostic panels are still there — they live one click away, on a checkout-api — drill-down dashboard linked from each golden-signal panel. The drill-downs are not lost; they are simply not the first thing a human reads.

The structural choice also forces a useful question at panel-creation time: which of the four does this panel explain? If a proposed panel does not explain one of the four, it does not belong on the user-facing health dashboard. A graph of inbound TCP retransmits is a fine drill-down under latency (because retransmits cause latency spikes via TCP retransmit timeouts) but is not a fine top-level panel because no user reads "TCP retransmits" and decides whether the app is broken. A panel showing the JVM old-generation occupancy is a fine drill-down under saturation (because old-gen filling triggers full GC pauses, which trigger latency spikes) but is not a fine top-level panel because the user does not feel old-gen occupancy. Forcing every panel to declare its parent signal eliminates roughly 60–70% of typical-service-dashboard panels from the user-facing layer, which is exactly the reduction needed to bring the dashboard within the 15-second leadership-reading budget covered in the previous chapter.

A third consequence — less visible but more important — is that the four signals make dashboard reuse across services possible. A four-panel top row (latency, traffic, errors, saturation) is the same layout for every service: the metric names differ, the thresholds differ, but the layout, the panel ordering, and the colour conventions are identical. An on-call engineer who knows how to read the payments-api dashboard already knows how to read the inventory-api dashboard, the notifications-api dashboard, and any service the team will ship in the next two years. Without the four-signal convention, every service's dashboard is a bespoke artefact built by whoever owned the service when it shipped, and on-call engineers must re-learn the dashboard for each service the night they get paged. The convention is a coordination mechanism — it is what lets a 40-person SRE team support 200 services without 200 distinct mental models.

The reuse property compounds at the platform-team level. A platform team that ships a service-template — Helm chart, Terraform module, Grafana dashboard JSON — can bake the four-signal layout into the template such that every new service inherits a four-signal dashboard on day one without the service's owner having to think about it. The owner overrides the metric names, the SLO thresholds, and the per-tenant labels; the layout and the alerting taxonomy are inherited. A 200-service estate with this discipline produces 200 four-signal dashboards that look identical to an on-call engineer; the same estate without it produces 200 idiosyncratic dashboards that take a quarter to learn. The cost of the discipline is the platform team's effort to maintain the template; the benefit is a one-day-to-productivity onboarding curve that scales with the engineering team rather than the service count. The four signals are a small list, but the operational leverage is large because every additional service inherits the same reading habit, and the reading habit is what produces fast diagnoses at 02:47 IST when the page fires.

A second consequence is alerting consolidation. The four signals correspond to the four classes of alerts that the SRE book recommends on user-facing services — latency-SLO burn, traffic-anomaly, error-budget burn, and saturation-threshold. A dashboard organised around the four signals naturally produces an alert taxonomy with at most four primary alerts per service, plus a small set of drill-down alerts that fire only as severity-2 informational pages. Teams that organise their dashboards around CPU and memory tend to produce thirty alerts per service, most of which are noise. The four-signal discipline imposes alert hygiene by anchoring it to user-visible behaviour rather than infrastructure plumbing.

Why the four-signal alert count caps at four primaries plus a handful of secondaries: each of the four signals has exactly one primary alerting question — is latency outside SLO? is traffic outside its baseline window? is the error rate burning the budget faster than the budget allows? is saturation above the head-room threshold? Each question takes exactly one alerting rule to express in PromQL. The CPU-and-memory school produces dozens of alerts because each infrastructure metric gets its own threshold, with no enforced linkage to user experience — cpu > 80%, memory > 85%, disk_io > 70%, tcp_retransmits > 100/s, repeated per service per region. The four-signal school produces four alerts because the user has only four questions, and the engineering discipline aligns with the user's mental model rather than fragmenting it across the resource graph.

A fourth consequence is review-able post-mortems. A post-mortem that says "the service degraded at 19:38, was detected at 19:47, and was rolled back at 20:10" without saying which of the four signals moved first, second, and third is incomplete — the diagnostic value of the post-mortem comes from the cascade timeline, and the cascade timeline is denominated in the four signals. The Karan/ZayikaGo post-mortem template that emerged from the 19:47 incident now requires every post-mortem to include a four-signal trace: at minute T+0, saturation rose from 60% to 84%; at T+2, latency p99 climbed from 110ms to 480ms; at T+4, error rate climbed from 0.1% to 8%; at T+6, traffic dropped 12% as clients gave up. The trace tells the team, in one paragraph, both what happened and how much warning the dashboard could have given if anyone had been looking at the right panel. Without the four signals as the unit of post-mortem analysis, the timeline becomes a narrative of system events ("the deploy went out", "the new code path fired", "the queue filled") that does not map cleanly onto what the user felt; with them, the timeline is the user-experience story translated into SRE language, and the lessons it produces are about dashboard ergonomics rather than about specific code paths.

A fifth consequence, often unstated, is that the four signals shape how the team talks about the service in meetings. A weekly service-health review that walks through latency, traffic, errors, and saturation in that order has a four-bucket structure that maps directly onto the four panels and onto the four user-facing questions. A review that has no canonical structure becomes a discursive recap of whatever incidents happened that week, with no shared frame for distinguishing trends from one-offs. The four-signal frame, used as the meeting's spine, lets the team observe trends across weeks — "saturation has been creeping up over the last three weeks even though the other three signals are fine; that is the kind of thing we want to investigate before it cascades" — that a free-form review would not surface. The four-signal discipline is, in this sense, the engineering team's communication protocol with itself, and the dashboard is the artefact that makes the protocol concrete every time anyone opens it.

Edge cases the four signals do not cover cleanly

The four-signal framing is the floor, not the ceiling. There are categories of service shape where the four signals leave gaps that a thoughtful team patches with a fifth or sixth signal — never by replacing the four, always by extending them. The categories are worth naming because they are where teams that have internalised the four-signal discipline still get surprised.

Streaming and queue-driven services. A Kafka consumer, a CDC pipeline, an event-driven worker — none of these have a request-response shape that maps cleanly onto latency. Their analogue of latency is processing lag — the difference between the wall-clock time a record was produced and the wall-clock time it was processed. A consumer that processes every record in 5ms but is 4 hours behind the producer is, from the user's perspective, broken; the per-record latency signal would say the service is healthy. The right four-signals adaptation is to substitute lag for latency on the latency panel — kafka_consumergroup_lag_seconds instead of request_duration_seconds — and keep the other three signals as-is. Traffic becomes records-consumed-per-second; errors becomes deserialisation-failures-per-second; saturation becomes consumer-thread-pool-utilisation. The four user-visible questions still apply, but the answer to "is it slow?" is now "are we behind?", which is a different question with the same shape.

Asynchronous APIs that return 202 Accepted. A service that immediately returns 202 and processes the request in the background has a latency signal that is misleadingly fast — the 202 returns in 8ms regardless of whether the background work succeeded. The four-signal discipline forces you to measure latency on the eventual outcome, not the synchronous response. The Razorpay UPI flow is an example: the client gets a 202 instantly, the actual transaction completes 12 seconds later when NPCI confirms, and the user-visible latency is the 12 seconds, not the 8ms. The right adaptation is to instrument the terminal-state latency — the time from request acceptance to the request reaching its terminal state (SUCCESS, FAILED, TIMEOUT) — as a separate histogram, and use that histogram for the latency panel. The synchronous-response latency is still measured but is shown only in drill-downs, because it is not what the user feels.

Multi-step user journeys. A checkout flow that spans cart → payment → confirmation does not have one latency to graph; it has three, and the user feels the sum. A four-golden-signals dashboard at the per-service granularity will show three calm panels even when the end-to-end checkout is taking 18 seconds because the user crossed three slow services that each look fine in isolation. The mitigation is a journey-level dashboard above the per-service dashboards, computed via tracing — the trace's root span duration is the journey latency, and the four golden signals project from the journey rather than from any single service. The per-service four-signal dashboards still exist (they are the drill-downs); the journey four-signal dashboard is the leadership-facing surface.

Why journey-level dashboards do not replace per-service dashboards: when a journey-level latency panel goes red, you need the per-service panels to localise which step caused it. The journey panel answers "is the user feeling pain?"; the per-service panels answer "where is the pain coming from?". A team that has only one or only the other gets stuck — only journey-level means you cannot localise; only per-service means you do not know whether a healthy-looking service is actually contributing to a broken user experience.

Bursty workloads where saturation is meaningless on average. A trading service at Zerodha during the 09:15 IST market open or a ticketing service at IRCTC during the 10:00 IST Tatkal window has a saturation profile that looks calm at minute-granularity averages and is at 100% for the load-bearing 90-second window. The four-signals dashboard on a 1-minute average will miss it; the dashboard needs 5-second resolution on the saturation panel during the burst window. The general fix is to allow per-panel resolution overrides on the four-signal dashboard — saturation at 5s during predictable burst windows, latency at 1m during steady-state — rather than imposing a single dashboard-wide resolution that is wrong for one of the four panels.

Multi-tenant services where one tenant dominates. A shared notifications-api that serves both Razorpay-pattern fintech traffic and Swiggy-pattern delivery traffic has a four-signal dashboard at the service level that hides per-tenant pathology. If Razorpay traffic is 90% of the total and Swiggy traffic is 10%, a Swiggy-specific outage — say, a 50% error rate on Swiggy-only requests — looks like a 5% error rate at the service level, well within SLO. The four signals at the aggregate level will not fire; the Swiggy on-call team will hear from Swiggy users before the shared-notifications team hears from their own dashboard. The mitigation is to graph the four signals both at the aggregate level (for the shared-platform team's dashboard) and at a per-tenant breakdown (for the per-tenant alerting). The four signals do not change; the cardinality of how they are projected does. Most multi-tenant teams discover this only after their first per-tenant outage that the aggregate dashboard missed.

Internal services that are themselves SLO-protected against external services. A service like ledger-write that is called only by other internal services and never by an end user has a four-signal dashboard whose "user" is another service, not a person. The four signals still apply but the latency and error thresholds are tighter — a 200ms p99 that is fine on a user-facing API is unacceptable on an internal write path that has its own downstream callers waiting on it. The cascade order from earlier — saturation, then latency, then errors, then traffic — also compresses: an internal service has fewer retry layers absorbing slowness, so saturation and latency moving together produce errors in seconds rather than minutes. The four-signal dashboard for an internal service therefore needs tighter thresholds, faster scrape intervals, and more aggressive baseline-relative alerting than the same dashboard for a user-facing service handling the same volume.

Services with multiple distinct SLOs per endpoint. A merchant-api that serves both a user-facing GET endpoint (latency-sensitive, p99 SLO of 200ms) and a bulk-import POST endpoint (throughput-sensitive, no per-request SLO) has a four-signal dashboard at the service level that averages across both shapes and produces a misleading combined view. A latency p99 that aggregates a 10ms GET and a 4-second bulk-POST will show a number that does not represent either workload accurately. The mitigation is to graph the four signals per endpoint on services that have heterogeneous endpoint shapes — the dashboard becomes a small grid of four-signal rows, one row per endpoint class — and to keep the service-level four-signal dashboard only for the aggregate-volume view. This is one of the few cases where the four-signal frame produces more dashboard real estate rather than less, and the discipline is to recognise the heterogeneity before it produces a missed outage.

Common confusions

"The four golden signals are the same as RED or USE." RED (rate, errors, duration) is the request-side subset; USE (utilisation, saturation, errors) is the resource-side subset. The four golden signals are the union, and the union is what you need on a user-facing service. RED alone misses saturation as a leading indicator; USE alone misses traffic-drop as a silent failure. Wilkie's RED method is great for stateless web services where saturation is well-approximated by latency; Gregg's USE is great for resources where request-shape is unclear. Neither subset is a substitute for the four together on services that handle user requests and own internal queues.
"Latency means the average response time." Latency in the SRE-book sense is the distribution of successful-request durations, summarised by percentiles — typically p50, p99, and p99.9. The mean is unstable under heavy-tailed distributions and tells you nothing about the user who waited 4 seconds while the median user waited 40ms. Always alert on percentiles, never on the mean. The mean is a leading-indicator of the existence of a tail, not a measurement of it.
"Errors and latency overlap because a slow request might time out." They overlap on edge cases but conceptually they are independent: errors is a discrete count of failures, latency is a distribution over successes. A timeout is an error, not a slow request. Mixing them — by including timeouts in the latency histogram or by computing an error rate over both fast and slow failures together — destroys the diagnostic independence the four signals depend on. Keep them separate; they are separate panels and separate alerts.
"Saturation just means CPU utilisation." Saturation is the utilisation of the most-constrained resource on the service, which on most modern services is not CPU. On a connection-pool-bound service it is connection-pool occupancy. On a queue-bound service it is queue depth. On a thread-pool-bound web service it is in-flight-request count divided by worker capacity. Picking the wrong saturation metric is the most common four-signals mistake — the panel says "we are fine" because CPU is at 30% while the connection pool is at 100% and the service is queueing every request behind a database that is already saturated on its own.
"Traffic is just a vanity metric." Traffic is the only one of the four signals that catches upstream failures — load balancer health-check storms, DNS misconfigurations, and CDN-level rate-limiting all manifest as the service receiving fewer requests than it should. Without a traffic baseline alert, those failures are silent at the service level because the service's own metrics see only the requests that arrived. Traffic-drop alerts (firing when traffic falls below 50% of the 24-hour baseline) are how you catch the LB-decided-you're-unhealthy class of outage.
"You need separate dashboards per signal, or the four signals are only an SRE concern." Neither holds. The four belong together on the top row of the same dashboard because the diagnostic value comes from reading them simultaneously — a fingerprint, not four independent indicators. They are also the smallest shared vocabulary between SRE and product: a product manager who learns to read the four panels can answer their own question about whether a feature-flag rollout is hurting user experience without paging an engineer, because the latency and error panels will tell them within minutes whether the rollout is causing damage. The frame is cross-functional precisely because it is anchored to user-visible behaviour rather than to engineering implementation detail.

Going deeper

Why the SRE book uses "saturation" rather than "queue depth"

Ewaschuk's choice of "saturation" rather than the more specific "queue depth" or "thread-pool occupancy" is a deliberate generalisation. A service's most-constrained resource depends on its architecture: a Java web service is usually thread-pool-bound at moderate load and heap-bound at extreme load; a Go service with a fixed worker pool is goroutine-bound; a Python WSGI service under gunicorn is process-pool-bound; a database-backed API is often connection-pool-bound; a streaming pipeline is often Kafka-consumer-lag-bound. The right saturation metric for a service is not a fixed choice — it is a property of which resource hits 100% first under load. The discipline is to find that resource per service and graph its utilisation as a 0..1 ratio, rather than reflexively graphing CPU and hoping it is the bottleneck. The cost-of-getting-this-wrong is the queue-filling-silently scenario from earlier.

The interaction with SLOs and error budgets

The four signals are the measurement substrate underneath Service Level Objectives (covered in chapter 64). An SLO on availability is computed from the errors signal — 1 - (failed requests / total requests) over a 30-day window. An SLO on latency is computed from the latency signal — fraction of requests completing under threshold T. The traffic signal is the denominator of both ratios, which is why a sudden traffic drop can artificially improve an SLO computation (fewer requests means fewer failures, which mathematically looks like reliability gains but is actually the opposite). Saturation does not appear directly in user-facing SLOs because it is an internal measurement, but it appears in capacity-planning SLOs — "we will keep saturation under 80% across the 30-day window so that the next traffic spike has 25% of headroom to absorb". The four signals are the source data for the SLO discipline; without them, SLOs are computed from secondary metrics that drift.

The Razorpay payment-route saturation signal — a real-shape pattern

A hypothetical Razorpay-pattern payments processor running on AWS in 2024 would face the saturation-of-the-right-resource problem acutely. Its payments-router service routes UPI requests across NPCI, ICICI, HDFC, and SBI gateways depending on routing rules and gateway health. The service is CPU-light — most of its time is spent waiting on downstream HTTPS calls — so CPU is not the constraint. The constraint is the per-gateway connection pool: the service holds a fixed pool of 64 keep-alive connections to each gateway, and when a gateway slows down, the connection pool fills before any other resource saturates. The right saturation panel for this service is not node_exporter's CPU graph; it is a per-gateway stacked bar chart of connection_pool_used / connection_pool_capacity, with each gateway as its own series. When the SBI gateway goes through a slow patch — a real-shape pattern that happens roughly twice a quarter on production UPI traffic — the SBI bar goes from 30% to 100% over six minutes, alert fires at 80%, and the routing rules shift load to the other three gateways before user-visible failures occur. The wrong saturation panel — CPU — would have stayed flat throughout and the user would have felt it before the team did.

The Hotstar IPL traffic-as-leading-indicator pattern

A hypothetical Hotstar-pattern OTT service during an IPL final has a traffic signal that does not behave like a normal web service. Traffic spikes 50× in the 15 minutes before kickoff and stays elevated for 4 hours; the four-golden-signals dashboard for the play-stream service has to use a baseline-relative traffic alert rather than an absolute-threshold alert. The right configuration is traffic / traffic[1h offset 7d] — current-traffic divided by same-time-last-week — with alerts firing when the ratio falls below 0.5 or rises above 5. The absolute traffic signal alone would alert constantly during a final and stay quiet during a midweek dud. The traffic-signal discipline therefore generalises from "graph requests per second" to "graph requests per second relative to the appropriate baseline", and the appropriate baseline is service-specific. For a payment service, baseline is "this hour, last week"; for an e-commerce service running the Big Billion Days, baseline is "this hour, last year, plus a planned-uplift factor"; for a B2B service with weekday/weekend skew, baseline is "this hour, last weekday".

The IPL pattern also stresses the latency signal in a way that reveals a four-signals subtlety: latency p99 during a 50× traffic spike will mechanically rise even if the service is healthy, because the histogram now contains 50× more samples and the p99 quantile naturally surfaces a longer tail. A latency-SLO that fires when p99 exceeds an absolute threshold of 200ms will fire constantly during the spike and produce alert fatigue that masks the real outages. The fix is the same baseline-relative discipline applied to latency — alert on latency_p99 / latency_p99[1h offset 7d] > 1.5 rather than on an absolute number — and to use a separate dashboard for steady-state versus burst-window operations. The four signals stay the same; the thresholds on each signal become functions of the time of day and the day of week, and the dashboard exposes the thresholds as a visible reference line rather than hard-coding them in the alert rule. Teams that have not made this generalisation discover, around their first IPL final or their first Diwali Big Billion Days, that absolute thresholds are not a property of the service; they are a property of the service's expected workload, and the workload changes over time.

Coordinated omission, silent-success, and the two ways the latency signal lies

The latency signal is the most prone to two distinct lies that a four-signal dashboard would otherwise hide. The first is coordinated omission (named by Gil Tene) — the measurement artefact where a slow request blocks the load generator from sending the requests it would have sent during the slowness, so those would-have-been-sent requests are silently dropped from the histogram. The result is a p99 that looks much better than the user's actual experience because the worst-affected users (who would have been served during the slow patch) are missing from the data. The fix is an open-loop generator: wrk2 -R200, vegeta -rate=200, or any constant-arrival-rate tool. Closed-loop tools like wrk (without -R), ab, and siege produce CO-biased histograms that flatter the service.

The second is silent-success — where the service returns 200 OK on requests it should have rejected. A payments-api that accepts a malformed UPI VPA and returns {"status": "success", "tx_id": "..."} with a fabricated tx_id will produce a healthy four-signal panel: latency fine, traffic fine, errors zero (the buggy code path returns 200), saturation fine. Users discover the bug days later when the transaction never appears in their statement. The four signals catch this only via the traffic dimension on the immediate downstream — if the payment-settlement service that should receive the fabricated tx_id is suddenly receiving a flat-line of unrecognised IDs, its error rate goes up and the upstream's silent-success becomes visible. The discipline is to graph the four signals on a service alongside the four signals on its immediate downstream consumers, and to treat a downstream-error spike that does not correspond to an upstream-error spike as a silent-success indicator. The four-golden-signals dashboard is a per-service surface; the silent-success failure is detected only by reading two adjacent dashboards together.

Where this leads next

The four signals are the foundation; the next chapters layer on the discipline of measuring them well and converting them into actionable artefacts. Chapter 77 (/wiki/use-vs-red-dashboards) makes the USE-vs-RED-vs-four-golden taxonomy explicit and shows when each is the right floor for a given service shape. Chapters 78 and 79 cover dashboard hierarchy and the on-call-vs-leadership audience split that the wall-dashboards article opened. Chapter 80 introduces panel arithmetic — the discipline of pre-computing comparisons (burn rate, headroom-remaining, ratio-vs-baseline) so the reader does not have to do arithmetic in the meeting.

Part 13 (/wiki/wall-numbers-mean-nothing-without-targets) onwards converts the four signals into SLOs and error budgets — the four signals are the SLI inputs; the SLOs are the targets; the error budgets are the headroom; the burn-rate alerts are the page. Part 14 covers alerting hygiene; the four-signal discipline naturally produces the four-class alert taxonomy that Part 14 spends a chapter formalising. By the end of this curriculum, the four golden signals are the spine that connects the request-handling chapters (/wiki/wall-logs-alone-cant-stitch-a-request-across-services) to the user-experience chapters via SLOs and humane alerting.

The thread that ties all of these together is that the four signals are not a checklist; they are a frame. A team that adds latency, traffic, errors, and saturation panels to a sprawling pre-existing dashboard has technically satisfied the SRE-book recipe and has gained nothing — the four panels are buried among the eighty-two and the on-call engineer's eye still does not land on them first. The four-signal discipline is, in practice, a discipline of removal — strip the dashboard down to the four, then earn each additional panel by justifying which of the four it explains. Most teams discover that the dashboard is more useful with twelve panels than with eighty-two, and that the four golden-signal panels are doing 90% of the diagnostic work. That ratio — four panels carrying 90% of the load, eight panels carrying the remaining 10% — is the test of whether the discipline has been internalised or whether the dashboard is still inheriting the panels it had before.

A note on the saturation panel that is worth pausing on, because it is the signal teams trip over most often. the saturation metric on most production services is the latest scrape value of a gauge, not a rate computed over a window. This makes saturation the only one of the four signals that is read as a point-in-time level rather than a per-second rate. Latency is histogram_quantile over a 1-minute rate; traffic is a 1-minute rate over a counter; errors is a ratio of two 1-minute rates. Saturation, by contrast, is just checkout_saturation_ratio — the most-recently-scraped value of a gauge. The asymmetry matters because saturation can change faster than the scrape interval, and a 15-second-default scrape can miss saturation excursions that complete inside the gap. Production-critical services usually drop the saturation scrape interval to 5 seconds and accept the cardinality cost, because the leading-indicator value of saturation depends on catching the rise before the cascade kicks in, and a 15-second scrape is too coarse for that property to hold.

A final framing worth retaining: the four signals are not a technology — they are a language. The technology underneath could be Prometheus and Grafana, or Datadog, or Cloudwatch, or hand-rolled SQL queries against a PostgreSQL events table; the language stays the same. A team that switches from Prometheus to Datadog mid-quarter does not have to retrain its engineers on what to look at first because the four-signal frame is portable across backends. A team that switches from a four-signal dashboard to a three-signal dashboard during the same migration loses substantially more institutional knowledge than the technology migration itself caused. The investment is in the frame, not in the tooling — which is why the SRE book's chapter on monitoring distributed systems has aged better than most chapters in most observability books written between 2016 and 2026. The four signals are the part of the curriculum that does not depreciate.

The same portability property holds across organisations. An engineer who has internalised the four-signal frame at one company carries it intact to the next; the new company's services have different metric names, different SLO thresholds, different infrastructure choices, but the four questions the dashboard has to answer are unchanged. The hiring leverage is small but real: a senior SRE who joins a Razorpay-pattern fintech from a Hotstar-pattern OTT background is operationally productive on the four-signal dashboard within their first week, because the frame is shared. The interview question "walk me through how you read this dashboard" produces broadly similar answers from senior engineers at any Indian unicorn that has internalised the discipline, and the answers diverge meaningfully only on the technology and the SLO numbers underneath. The frame is the engineering culture, not the toolchain — and that is what makes the four signals load-bearing rather than fashionable.

The saturation discipline is one of the few places where the four-signal frame demands an opinion about scrape interval per panel rather than per dashboard, and it is the place where teams who treat the four panels as four-of-equal-priority lose information that the other three signals would never have caught.

References

Site Reliability Engineering, Chapter 6: Monitoring Distributed Systems — Rob Ewaschuk's original four-golden-signals chapter, the canonical reference.
The RED Method — Tom Wilkie's request-side framing, a subset of the four golden signals.
"Systems Performance" — Brendan Gregg, Chapter 2.5: USE method — the resource-side framing, the other subset.
Gil Tene — "How NOT to Measure Latency" — coordinated omission, mandatory viewing for anyone graphing latency.
Charity Majors et al — Observability Engineering, Chapter 4 — the modern reframing of golden-signals via high-cardinality events.
/wiki/wall-numbers-mean-nothing-without-targets — internal: how the four signals become SLO inputs.
/wiki/wall-dashboards-are-where-observability-touches-leadership — internal: why panel ordering matters as much as panel content.
Prometheus histogram_quantile() documentation — the quantile interpolation function used in the latency PromQL.

# Reproduce this on your laptop
docker run -d -p 9090:9090 prom/prometheus
python3 -m venv .venv && source .venv/bin/activate
pip install flask prometheus-client requests
python3 golden_signals_demo.py &
# In another terminal, install wrk2 (brew install wrk2 / apt install wrk2)
wrk2 -t4 -c32 -R200 -d30s -s post.lua http://localhost:8000/checkout
curl -s http://localhost:8001/metrics | grep checkout_