Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

USE vs RED dashboards

It is 03:18 IST on a Saturday and Dipti, an SRE at a hypothetical Bengaluru-based payments platform we will call PaytraOne, is on her second cup of coffee staring at two dashboards on the same monitor. The left dashboard is the one her team built two years ago — payments-host-health — and it follows Brendan Gregg's USE method to the letter. CPU utilisation per node, disk I/O saturation, NIC errors, memory pressure, scheduler runqueue depth — every infrastructure metric on every one of forty-three Kubernetes nodes is a calm green. The right dashboard is the one her predecessor on the SRE team built last quarter — payments-api-red — and it follows Tom Wilkie's RED method. Request rate per endpoint, error rate per endpoint, p99 duration per endpoint. Three rows, six panels, all of them now an aggressive red. The /v1/charge endpoint has been at 18% error rate for eleven minutes and Dipti's PagerDuty alert fired at 03:07 because the RED dashboard's error-rate alert triggered. The USE dashboard caught nothing because the resources are fine — every node has CPU at 45%, every disk has saturation at 12%, every NIC has zero errors. The USE dashboard is technically correct and operationally useless for this particular outage; the RED dashboard caught the user-facing failure in 90 seconds.

USE (utilisation, saturation, errors) and RED (rate, errors, duration) are the two named dashboard methods, each a three-letter recipe for a service-health view. USE describes resources from the inside — useful for capacity planning, kernel-level pathology, and headroom diagnostics. RED describes requests from the outside — useful for user-facing service health, latency-SLO tracking, and traffic-anomaly detection. They are not interchangeable; a stateless web API needs RED at the floor, a memory-bound machine-learning training job needs USE, and a database backing a payments service needs both. The discipline is to pick the right floor for the service shape rather than picking one method as your team's house style.

The two recipes, what each one names, and where they came from

The USE method came out of Brendan Gregg's systems-performance work at Sun, Joyent, and Netflix in the late 2000s. Gregg's framing is resource-centric: every resource on a system has utilisation (how busy it is, expressed as a fraction of time the resource was performing work over the measurement window), saturation (how much extra work is queued waiting for the resource), and errors (how many error events the resource has emitted — disk read errors, NIC carrier errors, ECC memory corrections). USE projects these three numbers across every resource you can enumerate — CPUs, disks, NICs, memory channels, GPU cards — and produces a long, structured table where each row is a resource and each column is one of U, S, E. Gregg's USE checklist for a Linux server lists 31 distinct resources, each with their three USE columns, producing a 93-cell health view. The discipline is a systematic enumeration — you have not finished the USE check until every resource on the system has been visited and its three numbers read.

The RED method came out of Tom Wilkie's work at Weaveworks in 2015, generalised from operational practice at Google, where Wilkie had previously been an SRE on the Borg control plane. RED is request-centric: every request-handling service has rate (requests per second — not just successes, the total arrival rate), errors (the rate of failed requests, or the error fraction depending on convention), and duration (the latency distribution of those requests, summarised by percentiles or histograms). RED is concise — three numbers per service — and it scales by service rather than by resource: a 200-microservice estate produces 600 RED numbers (3 per service), where USE on the same fleet produces ~6,000 numbers (93 cells × ~64 nodes if each node is its own resource pool). The asymmetry is deliberate: RED is designed to be the same shape across every service in the estate, so an SRE who knows how to read one service's RED dashboard already knows how to read every other service's RED dashboard.

USE and RED — two recipes, two viewpoints, the same word "errors"A two-column comparison. Left column: USE method, resource-centric, with utilisation as a fraction of busy time, saturation as queued work waiting, errors as resource-level error events. Right column: RED method, request-centric, with rate as total requests per second, errors as failed requests per second or as a fraction, duration as the latency distribution. A central pillar notes that the word "errors" appears in both methods but means different things — resource errors in USE, request errors in RED. A footer notes USE is per-resource enumeration, RED is per-service projection.USE vs RED — same word "errors", different units, different audienceUSE — resourcesBrendan Gregg, 2012U: Utilisationfraction of time the resource was busyCPU at 0.78, disk at 0.34S: Saturationextra work queued / waitingrunqueue depth, await timeE: Errorsresource-level error eventsNIC carrier err, ECC correctedenumerate every resourceRED — requestsTom Wilkie, 2015R: Raterequests per second arrivingtotal, success+errorE: Errorsfailed requests / failed-fraction5xx, business-rejected, timeoutD: Durationsuccess-latency distributionp50, p99, p99.9 from histogramproject per-service"errors" in USE counts NIC packets dropped; in RED it counts user requests that failed. Different units, different alerts.
Illustrative — USE is a per-resource enumeration discipline; RED is a per-service three-number summary. The shared word "errors" is a small linguistic trap: a NIC carrier error is not a 5xx HTTP response, and a dashboard that uses "errors" without qualifying which method it is in produces ambiguous on-call conversations.

The two methods also have different enumeration disciplines. USE is exhaustive — Gregg's recipe is "go through every resource you can name and check its three numbers; if you cannot enumerate the resource, you cannot apply USE to it". The discipline forces you to find resources you might otherwise forget — the kernel-slab allocator, the file-descriptor table, the per-CPU runqueue migration counter — that are real bottlenecks in the cases that go pathological. RED is selective — Wilkie's recipe is "for each service, three numbers; if you have the service, you have the three numbers". The discipline forces you to standardise across services rather than enumerate within one. The two enumeration disciplines produce dashboards that have very different shapes: a USE dashboard for one host has many rows and three columns; a RED dashboard for many services has one row per service with three columns. Multiplying USE across hosts produces a per-host-per-resource grid; multiplying RED across services produces a per-service grid; the two grids do not naturally fit on the same screen, which is one of the structural reasons why teams that tried to put both on one dashboard ended up with the platform-team-mixing-layers failure mode.

The genealogy matters because the two methods come from different operational pressures. Gregg was instrumenting kernels and JVMs at Joyent, where the failure modes were memory-bandwidth saturation, NUMA-locality misses, scheduler-runqueue pile-ups — failures that have no HTTP-request equivalent, that live entirely below the application layer. USE was the framework that let his team reason about these failures systematically without missing a resource. Wilkie was instrumenting microservices at Weaveworks during the Kubernetes-and-Prometheus boom, where the failure modes were broken deploys, slow downstreams, and traffic spikes — failures that are entirely about user-visible request behaviour and where the kernel almost never matters because the bottleneck has already moved up the stack. RED was the framework that let his team's customers reason about these failures concisely, with a small enough number of metrics that a 200-service dashboard fleet could be maintained without an army.

Why USE and RED both contain the word "errors" but mean different things: USE counts resource-level errors — NIC carrier errors, disk read errors, ECC memory corrections, dropped packets in the IP stack. RED counts request-level errors — HTTP 5xx, gRPC INTERNAL, business-logic rejections, timeouts that never produced an HTTP code at all. A service can have zero USE errors (every disk is healthy, every NIC is clean) and 30% RED errors (every third user request is failing) at the same instant, because the failure mode is in the application logic, not in the hardware. A dashboard that just says "errors" without specifying which kind invites confusion at 03:00 when the on-call engineer has to translate between two adjacent panels in their head. Always label which method's errors you are showing.

The two methods also disagree on what "saturation" means versus what "duration" means, in a way worth pausing on. USE saturation is resource queueing — the runqueue depth on a CPU, the number of in-flight I/O operations exceeding the disk's queue depth, the number of TCP connections in the SYN backlog. It is a unit of waiting work, measured at the kernel layer. RED duration is request latency — the wall-clock time a successful request took, measured at the application layer. These two are causally linked (a saturated CPU produces longer durations as requests queue) but they are not the same number, and they are not measured by the same instrumentation. A service that has a fully-saturated thread pool (USE saturation high) but where requests are returning fast because the work itself is cheap (RED duration low) is a real and common pattern; the converse — fast threads but slow requests — is also common when the slowness is in a downstream call rather than in local resources. The two methods do not subsume each other.

A second linguistic subtlety is the asymmetry in how each method handles time. USE numbers are point-in-time samples of resource state — utilisation now, saturation now, error count since last reset. They are read as gauges in PromQL terms, evaluated by the latest scrape. RED numbers are intrinsically windowed — rate per second over the last minute, error fraction over the last 5 minutes, p99 over the last 10 minutes. RED has no concept of a point-in-time reading because a single request does not have a "rate". The window-vs-point asymmetry produces panel-design choices that look subtle but matter at incident time: a USE saturation spike that lasts 800 ms can be missed by a 15-second-default scrape interval, where a RED p99 spike that lasts 800 ms gets absorbed into the latency window and shows up as a small bump rather than a sharp spike. Sleep-deprived eyes read these two panel shapes differently — sharp spikes catch attention, gradual bumps do not — and that difference is built into how each method's metrics are defined, not into the dashboard's styling.

The four golden signals from the previous chapter are essentially the union of USE and RED on the dimensions that matter for user-facing services — latency from RED's duration, traffic from RED's rate, errors from RED's errors (with USE's resource-level errors as a drill-down), and saturation from USE's saturation (typically picking a single most-constrained resource rather than enumerating all of them). The four-signals frame is what you get when you take the diagnostic value of each method and project it onto the user's experience. USE on its own is too low-level for user-facing dashboards; RED on its own misses the saturation leading-indicator. The four-signals union captures both, which is why most modern SRE practice uses the four-signals frame as the leadership-facing surface and keeps USE and RED as drill-down regimes underneath.

Measuring USE and RED side-by-side from the same Python service

The cleanest way to internalise the difference between USE and RED is to instrument one service with both, run a single load test, and look at what each set of numbers does. The script below is a Flask payment-charge service that exposes both a RED-style metric set (request rate, error rate, success-duration histogram) and a USE-style metric set for its single most-constrained resource (the worker pool — utilisation as fraction-busy, saturation as queue depth, errors as worker exceptions). Both are emitted via prometheus-client to the same /metrics endpoint, scraped together, and we walk through what each panel would show under a synthetic load.

# use_vs_red_demo.py — one Flask service, both methods, same /metrics
# pip install flask prometheus-client requests
import random, time, threading
from flask import Flask, jsonify, abort
from prometheus_client import Counter, Histogram, Gauge, start_http_server

app = Flask(__name__)

# === RED metrics — request-centric ===
RED_RATE = Counter("charge_requests_total",
                    "total charge requests by status",
                    ["status"])  # 200, 500
RED_DURATION = Histogram("charge_request_duration_seconds",
                          "success-latency distribution",
                          buckets=[0.005, 0.01, 0.025, 0.05, 0.1,
                                   0.25, 0.5, 1, 2.5, 5])

# === USE metrics — resource-centric (the worker pool is the resource) ===
WORKER_CAPACITY = 16
USE_BUSY_TIME = Counter("worker_busy_seconds_total",
                         "cumulative seconds workers spent doing work")
USE_SATURATION = Gauge("worker_queue_depth",
                        "requests waiting for an idle worker")
USE_ERRORS = Counter("worker_exceptions_total",
                      "worker-side exceptions by type",
                      ["exception"])

semaphore = threading.BoundedSemaphore(WORKER_CAPACITY)
queue_depth = [0]; lock = threading.Lock()

@app.route("/charge", methods=["POST"])
def charge():
    with lock: queue_depth[0] += 1; USE_SATURATION.set(queue_depth[0] - WORKER_CAPACITY)
    semaphore.acquire()
    with lock: queue_depth[0] -= 1; USE_SATURATION.set(max(0, queue_depth[0] - WORKER_CAPACITY))
    start = time.perf_counter()
    try:
        if random.random() < 0.04:
            USE_ERRORS.labels("timeout").inc()
            RED_RATE.labels("500").inc()
            abort(500)
        time.sleep(random.expovariate(1/0.06))      # 60ms mean
        elapsed = time.perf_counter() - start
        USE_BUSY_TIME.inc(elapsed)
        RED_RATE.labels("200").inc()
        RED_DURATION.observe(elapsed)
        return jsonify(ok=True)
    finally:
        semaphore.release()

if __name__ == "__main__":
    start_http_server(8001)
    app.run(port=8000, threaded=True)

After 60 seconds of wrk2 -t4 -c64 -R250 -d60s -s post.lua http://localhost:8000/charge, scraping /metrics produces output that splits cleanly into the two methods. The RED numbers and the USE numbers are both real, both correct, and they answer different questions:

# Sample run output (curl localhost:8001/metrics | grep -E '^(charge_|worker_)')
# === RED-side ===
charge_requests_total{status="200"} 14387
charge_requests_total{status="500"} 613
charge_request_duration_seconds_bucket{le="0.05"} 7912
charge_request_duration_seconds_bucket{le="0.1"} 12104
charge_request_duration_seconds_bucket{le="0.25"} 14201
charge_request_duration_seconds_bucket{le="+Inf"} 14387
charge_request_duration_seconds_count 14387
charge_request_duration_seconds_sum 1142.66

# === USE-side ===
worker_busy_seconds_total 879.42
worker_queue_depth 0.0
worker_exceptions_total{exception="timeout"} 613

Walking through the load-bearing lines: RED_RATE = Counter(... ["status"]) is the one-counter-two-signals pattern from the four-golden-signals chapter — sum(rate(charge_requests_total[1m])) gives R, sum(rate(charge_requests_total{status="500"}[1m])) / sum(rate(...)) gives E, and the success histogram gives D. USE_BUSY_TIME tracks cumulative seconds workers spent inside the work — utilisation in PromQL becomes rate(worker_busy_seconds_total[1m]) / 16 (busy-seconds-per-second divided by capacity, producing a 0..1 ratio). USE_SATURATION = Gauge("worker_queue_depth") is the saturation panel — it is positive only when more requests are arriving than the 16-worker pool can serve immediately. Why USE saturation is "queue depth minus capacity" not just "queue depth": a pool with 16 workers that has 16 in-flight requests is at 100% utilisation but zero saturation — there is no work waiting. Saturation only counts the excess — the queueing beyond what the resource can absorb. The asymmetry is what lets utilisation and saturation coexist as two different numbers; mixing them produces a metric that is neither.

The interesting observation comes from running the script under three different load profiles and reading both sets of panels:

Load profile RED panel reading USE panel reading Diagnostic value
100 RPS steady (well below capacity) rate=100/s, errors=4%, p99=180ms utilisation=0.38, sat=0, errors=4 Both calm. Service is healthy by both methods.
250 RPS steady (near capacity) rate=250/s, errors=4%, p99=240ms utilisation=0.94, sat=2-5, errors=10 RED says "fine, p99 a bit higher". USE says "saturation rising — about to break". USE is the leading indicator.
250 RPS but downstream slow rate=250/s, errors=4%, p99=2400ms utilisation=0.96, sat=18, errors=10 RED catches the latency degradation. USE catches that workers are blocked, not that they are doing more work.
Burst 800 RPS for 10 seconds rate spikes to 800/s, errors spike to 22%, p99=4800ms utilisation=1.0, sat=42, errors=200 Both go red simultaneously because the burst exceeds capacity by enough that the failures are visible at both layers.

The pattern: when the bottleneck is resource exhaustion at the local service, USE leads RED by 60–120 seconds — saturation rises before duration does, and a USE-aware team has time to add capacity or shed load before users see slowness. When the bottleneck is application logic or downstream slowness, RED leads USE — durations climb without local saturation rising, because the workers are blocked on a slow external call rather than busy with local work. The two methods have different temporal sensitivities, and their value depends on which class of failure your service tends to produce.

Why USE saturation can stay flat during a downstream-induced latency spike: a worker that is time.sleep-ing waiting for an HTTP response from a downstream service is not "doing work" in the kernel scheduler's sense — it is parked on a futex or epoll, contributing nothing to runqueue depth or CPU utilisation. The local USE panels read as healthy because, locally, the service is mostly idle; the downstream is the bottleneck. RED captures this because the request is slow regardless of where the slowness lives. A pure-USE dashboard misses entire classes of distributed-systems failures because USE was designed for single-machine resource exhaustion before microservices made downstream calls a dominant source of latency.

The script also demonstrates an important nuance: a single service often has both a USE story (its own worker pool, its own database connection pool, its own memory) and a RED story (the requests it serves), and the two are independent. A team that reflexively builds only one type of dashboard misses the other dimension. The right pattern, which we will cover in the next H2, is to build both — but to know which one is the floor (the dashboard you read first) versus which is the drill-down (the dashboard you reach for after the floor goes red).

When to use which: a decision rule by service shape

The two methods are not interchangeable, and the team that picks one as the house style and applies it to every service in the estate ends up with dashboards that are wrong for half of them. The decision rule is by service shape — what the service does, what its constrained resource is, and who reads the dashboard.

Decision tree — which method is the floor for which service shapeA decision tree starting from "what does the service do?". First branch: serves user-facing HTTP/gRPC requests with SLOs leads to RED as the floor with USE as drill-down. Second branch: kernel-side or hardware-bound work leads to USE as the floor. Third branch: queue-driven worker or stream processor leads to a hybrid with lag substituting for duration. Fourth branch: database or storage system leads to both methods at the floor because both classes of failure are common. Fifth branch: pure batch job with no SLO leads to USE at the floor. A footer notes that picking the wrong floor produces dashboards that go green during the outage they should catch.decision tree — pick the floor by service shapewhat does the service do?user-facing HTTP/gRPCwith latency SLORED at the floorUSE as drill-downdatabase / storage(Postgres, Redis, S3)both at floortwo-row dashboardqueue/stream processor(Kafka consumer, CDC)RED with lag for DUSE on consumer poolbatch job / ML training(no per-request latency)USE at the floorthroughput as bonuskernel / hardware-bound(eBPF, GPU compute)USE at the floorno request shapepicking the wrong floor produces a dashboard that stays green during the outage you needed it to catchRazorpay UPI router with USE-only floor would miss every NPCI-slowness incident; Hotstar transcoding cluster with RED-only floor would miss every GPU-saturation incident.Illustrative — based on real-shape decision patterns observed across Indian unicorn SRE teams 2022-2026.
Illustrative — the floor is the dashboard your on-call engineer reads in the first 10 seconds. The drill-down is what they reach for after the floor goes red. A wrong-floor dashboard stays calm during exactly the outages it needed to catch.

The first row of the decision tree — user-facing HTTP/gRPC with a latency SLO — covers most services in a typical Indian fintech or e-commerce estate. A Razorpay payments-router, a Flipkart cart-api, a Swiggy delivery-tracker, a Zerodha Kite order-router — all of these have an SLO denominated in user-visible latency, and all of them have failure modes (broken deploy, slow downstream, error spike) that are captured by RED panels and missed by USE panels. The team's instinct should be RED at the floor and USE as a drill-down underneath. A USE-only dashboard for a payments-router will go green during every NPCI-slowness incident because the local resources are fine while every user request is timing out — exactly the wrong-floor failure mode.

The second row — database or storage system — is the case where both methods belong at the floor. A Postgres database has both a RED story (queries arrive at a rate, queries fail, query latency is the duration) and a USE story (the buffer pool fills, the WAL queue saturates, IOPS errors accumulate at the disk layer). A Redis cluster has the same shape: RED on GET/SET requests, USE on memory utilisation and eviction queue. A team that puts only RED on the database dashboard misses memory pressure and IOPS saturation, both of which are leading indicators of latency degradation; a team that puts only USE on the database dashboard misses query-error spikes from broken application logic that the database could not have prevented. The right dashboard for payments-postgres-primary has a top row of RED panels (query rate by query type, error rate by error class, p99 latency by query class) and a second row of USE panels (buffer-pool fill ratio, WAL-write saturation, disk IOPS utilisation, replication-lag saturation) — both at the floor, side by side.

The third row — queue or stream processor — is the adaptation case from the four-golden-signals article. A Kafka consumer at Swiggy ingesting delivery-rider GPS pings, a CDC pipeline at PhonePe streaming Postgres wal2json events to Snowflake, an event-driven worker at Hotstar processing video-encoding jobs — none of these have a request-response shape, so the RED method's "duration" gets substituted with consumer lag (records-per-second behind the producer) and "rate" stays as records-consumed-per-second. The USE side stays meaningful: the consumer's worker pool has utilisation and saturation, and the consumer's broker connection pool has utilisation. A queue-driven service therefore needs both — a RED-with-lag panel at the floor for "are we keeping up?" and a USE panel below for "do we have headroom to catch up if we fall behind?". The team that builds only RED-with-lag will not see the worker pool saturating before lag visibly grows; the team that builds only USE will not see lag growing if the service is healthy enough that workers are not saturated.

The fourth row — kernel or hardware-bound work — is the case where USE belongs at the floor and RED is either irrelevant or a low-priority drill-down. A GPU transcoding cluster at Hotstar that ingests video files and emits transcoded variants does not have an HTTP-request shape worth measuring per-request — the work is a stream of GPU jobs, and the constrained resource is the GPU cards. The USE panel set (GPU utilisation per card, GPU memory saturation, ECC error counts) is the floor; the RED-style "rate of jobs completed" panel exists but is a secondary view because it does not capture why the throughput is what it is. The decision rule here is simple: if the service has no per-request SLO, USE is the floor.

The fifth row — batch jobs and ML training — also goes to USE. A nightly Airflow DAG that runs an Apache Spark aggregation, a Vertex AI training job on a TPU pod, a daily Hadoop job at IRCTC computing booking statistics — none of these have a per-record latency SLO that the user feels. They have a job-completion deadline (the "the report must be ready by 09:00 IST" property), and they have resources that determine whether they will hit the deadline (CPU, memory, disk I/O, network bandwidth). USE on the constrained resource is the floor; a coarse "throughput" panel (records-processed-per-second, training-step rate) is a useful drill-down but not the floor. A dashboard for a 2 TB Spark aggregation that shows only "records processed per second" at the floor will not help the on-call team answer "why is this 4× slower than yesterday's run" — that question is answered by the USE panels (executor memory saturation, shuffle-disk I/O saturation, network bandwidth utilisation), not by the throughput number.

The sixth implicit row — services that change shape over their lifetime — deserves a paragraph. A service that starts as a stateless API (RED at floor) and grows a database read-replica behind it (now needs both RED and USE on the database) and eventually adds a Kafka consumer for asynchronous notifications (now needs RED-with-lag for the consumer in addition to RED for the API and USE for the database) is a real trajectory for any service that ships features for two years. The team that built a RED-only dashboard at year zero and never revisited it is operating with a year-zero floor in a year-three service. The dashboard is a living artefact — its shape should track the service's shape, and the question "is RED still the right floor for this service?" is worth asking at every quarterly architecture review, not just at service-creation time.

A useful tactical addition: most teams discover, after operating for 18-24 months at scale, that the floor decision is not made once but is re-made every time the service grows a meaningful new dependency. A payments-router that was originally a stateless service calling a single Postgres database (RED at floor, USE on the database as drill-down) sprouts a Redis cache (now Redis needs USE drill-down too), then an SQS queue for asynchronous notifications (now needs RED-with-lag for the queue consumer), then a circuit-breaker against the NPCI rate-limit (now needs an external-resource USE row for the NPCI connection pool). After 18 months the service has 4 distinct operational personas — the stateless API, the database client, the Redis client, the queue consumer — each with its own floor decision. A single dashboard cannot serve all four cleanly; the discipline that has worked at multiple Indian fintechs is to build 4 dashboards linked from one landing page, with the landing page having a 4-tile summary (one per persona) where each tile shows the floor signals for that persona, and clicking a tile lands you on the full per-persona dashboard. The landing page is itself a fifth dashboard — a meta-dashboard whose floor signals are aggregations of the four sub-dashboards' floors.

A subtler property: the audience shifts the answer slightly even within a service shape. A leadership-facing dashboard for a user-facing API is RED at the floor (because leadership reads user-experience signals); the on-call dashboard for the same service might add USE alongside (because on-call cares about both), and the platform-team dashboard might lead with USE on the underlying Kubernetes nodes (because the platform team owns the resources). Three audiences, three dashboards, sometimes the same service — and the floor differs across them because the questions differ. The discipline is to recognise that "USE vs RED" is not a single answer per service; it is a per-audience-per-service decision, and the team that treats it as a single team-wide style decision underserves at least one audience.

What goes wrong when teams pick the wrong floor

The wrong-floor failure mode is operationally invisible until it produces an outage that the dashboard should have caught and didn't. The post-mortems read remarkably similarly across companies and across years, and the pattern is worth naming because it is the most-common dashboard-architecture mistake.

The USE-only fintech. A hypothetical Mumbai-based payment processor we will call PaiseDirect has an infrastructure team that came from a server-administration background — the team's dashboard tradition is per-host CPU, memory, and disk graphs, comprehensively built out for every one of 84 production nodes. When PaiseDirect ships a new microservice, the infrastructure team adds it to the existing USE-style dashboard and considers the dashboarding work done. At 17:14 IST on a Thursday, a deploy of merchant-onboarding introduces a regex bug that causes 23% of merchant signups to return 500. Every node's CPU is at 35%, every disk is at 18% saturation, every NIC has zero errors — the USE dashboard is calm green. The error rate is invisible because there is no RED panel. The merchant-success team discovers the issue 47 minutes later when a Slack channel from the sales team escalates that signup conversions have collapsed. The post-mortem identifies "no error-rate panel on the merchant-onboarding service" as the contributing factor and the team adds RED panels — but the underlying cause was the floor choice: USE was the wrong floor for a user-facing service, and the absence of RED was a symptom of that choice rather than an oversight.

The RED-only OTT platform. A hypothetical Bengaluru-based OTT platform we will call BingeWala has an SRE team that came from a microservices-and-Prometheus background — the team's dashboard tradition is RED panels per service, replicated across 130 services, with great hygiene around request-rate, error-rate, and latency-percentile panels. The team has no USE panels at all on its transcoding-pipeline service because "the transcoding service is a queue consumer, RED-with-lag covers it". At 23:48 IST on the night of an IPL final, the transcoding pipeline starts falling behind — lag climbs from 12 seconds to 9 minutes over 25 minutes. The RED-with-lag panel turns red and the on-call team is paged. They look at the dashboard and see lag rising, but they have no insight into why — there is no panel showing GPU utilisation, GPU memory saturation, or NVLink bandwidth saturation. The team spends 18 minutes SSHing into transcoding nodes and running nvidia-smi by hand to discover that one of the four GPUs in each node is at 100% utilisation while the other three are at 30%, because of a scheduling bug in the job dispatcher. The post-mortem identifies "no GPU USE panels on the transcoding-pipeline dashboard" as the contributing factor — the wrong-floor symptom again, RED was the wrong floor for a hardware-bound service and the absence of USE was the consequence.

The single-method-floor database dashboard. A hypothetical Pune-based SaaS company we will call DeshDataco runs a Postgres-based multi-tenant analytics service. Its dashboard for the primary database is RED-only — query rate, query error rate, p99 query latency. At 10:20 IST on a Monday, the buffer cache hit ratio drops from 97% to 62% over six hours because a new tenant's query pattern is full-table-scanning a 480 GB table that does not fit in the buffer pool. The RED panels show p99 query latency creeping up — from 80 ms to 240 ms — but the rise is gradual and the latency-SLO alert (set at p99 > 500 ms) does not fire. Meanwhile, every other tenant's queries are slower because they are competing for the buffer pool with the new tenant's full-scans. By 16:00 IST, three other tenants have escalated complaints about their dashboards being slow. The team eventually correlates the slowness to the new tenant after manually pulling pg_stat_database numbers — work that a USE panel set on the database (buffer cache hit ratio, shared-buffer eviction rate, query-by-tenant CPU utilisation) would have surfaced in 30 seconds. The post-mortem identifies "no USE panels on the primary database dashboard" — the database needed both methods at the floor, and RED-only was insufficient.

The queue-consumer team that read RED literally. A hypothetical Gurugram-based food-delivery platform we will call FastBite ran a Kafka consumer service delivery-event-processor that ingested rider GPS pings, restaurant order updates, and customer cancellations. The team had read the RED method writeup and built a textbook three-panel RED dashboard for the service: requests-per-second, error-rate, p99 duration. The "duration" panel measured wall-clock time per Kafka record processed — a number that was always 12-18 ms and almost never moved. At 14:30 IST on a Saturday during a lunchtime promotion, the consumer fell 47 minutes behind the producer because a downstream Snowflake write started rate-limiting; the consumer was retrying writes in a tight loop, producing high CPU but normal per-record duration (each retry took 12 ms). RED's duration panel stayed calm; the team's monitoring did not catch the lag because they had not adapted RED to the queue-consumer shape — there was no kafka_consumergroup_lag_seconds panel. The post-mortem identified that "RED with per-record duration was the wrong adaptation for a queue consumer; lag is the right duration analogue". The fix was to substitute lag for duration in the RED triplet and add a USE-style panel for the consumer's worker pool. The wrong-shape RED dashboard had every right panel except the one that mattered.

The platform-team dashboard that mixes layers. A hypothetical Hyderabad-based logistics platform we will call CargoConnect runs a Kubernetes-based microservices platform with about 90 services. The platform team builds a single "platform overview" dashboard that has node-level USE panels for the underlying Kubernetes nodes and per-service RED panels for the top-traffic services on the same dashboard. The dashboard is 56 panels. When the on-call platform engineer is paged, they have to mentally separate "is this a node problem or a service problem" by reading panels in two different conceptual frames simultaneously. The post-mortem from an outage where a node-level memory pressure issue cascaded into per-service latency degradation identifies that the on-call engineer wasted 12 minutes on the per-service panels before realising the root cause was at the node layer. The fix was to split the dashboard into a USE-only "platform health" dashboard (the floor for platform-team on-call) and a RED-only "service overview" dashboard (the floor for service-team on-call), with cross-links between them. A single dashboard mixing both methods at the floor is rarely the right answer; the dashboard becomes hard to read because the two methods compete for the eye's attention rather than complementing each other.

A theme worth pulling out across these failure modes is team origin. The USE-only fintech came from a server-administration tradition where infrastructure metrics were the language; the RED-only OTT came from a microservices-and-Prometheus tradition where request metrics were the language; the database-RED-only team came from an application-engineering tradition where database internals were considered "ops territory" and not their concern; the FastBite Kafka team had read the RED writeup literally without adapting it to their service shape; the platform team mixed layers because the platform team's purview spans both layers and they did not separate the dashboards. In every case the origin of the team — the engineering culture they came from, the dashboards they had used at previous companies, the methods they had read about — became the default house style, and the default house style was wrong for at least one service shape in their estate. The fix in every case was to disentangle "this is how our team has always built dashboards" from "this is what this service needs". The disentangling is conscious work; it does not happen automatically as a team grows. The team that audits its dashboards every six months and asks, per service, "is the floor still right for this service's current shape?" is operating one floor up from the team that does not.

The common thread across all four failure modes: the wrong-floor dashboard is technically populated correctly — the panels exist, the metrics flow, the alerts are wired up — and yet the dashboard goes green during the outage it needed to catch. The fix is not to add more panels; the fix is to recognise the floor as a separate design decision from the content and to choose the floor by service shape rather than by team tradition. A team that has standardised on one method as the house style across all services has saved itself the cost of thinking about the choice per service; it has paid for that saving with a slow accumulation of wrong-floor dashboards across the estate, each one waiting to fail silently.

What "right floor" looks like in practice — three sketched dashboards

To make the decision tree concrete, here are the floor-row sketches for three real-shape Indian-fintech services. The sketches are layout descriptions rather than literal Grafana JSON; the idea is to ground the decision rule in actual panels rather than leaving it as theory.

Sketch one — Razorpay-pattern payments-router (RED at floor, USE drill-down). Top row: 3 large panels, left to right — request rate stacked by gateway (NPCI/ICICI/HDFC/SBI), error rate as fraction with a 0.5% SLO line, p99 latency by endpoint with the 200 ms SLO line. Second row: USE drill-down — per-gateway connection-pool utilisation as 4 small panels, each with its own 80% threshold. Third row: per-host CPU/memory USE for the router pods themselves. The floor is row 1; rows 2 and 3 are clicked into when row 1 goes red. A leadership-facing summary dashboard would show only row 1, larger; an SRE on-call dashboard would show all three rows; a platform-team dashboard would have row 3 expanded across all 12 router pods.

Sketch two — Hotstar-pattern transcoding-pipeline (USE at floor, RED-with-throughput drill-down). Top row: 3 large panels — GPU utilisation per card across the 8-GPU cluster as a heatmap (8 rows, 1 column per minute), GPU memory saturation as a stacked bar, NVLink bandwidth utilisation as a line chart. Second row: throughput in jobs-per-second and ECC error rate as small panels. Third row: video-codec-specific drill-downs (H.264, H.265, AV1 jobs separately). The floor is row 1 because the constrained resource is GPUs; the user does not feel per-job latency directly because transcoding is asynchronous. A team that put RED at the floor here would have caught the FastBite-shaped failure mode where lag rose without per-job duration moving.

Sketch three — payments-postgres-primary database (both methods at floor). Two top rows, both at the floor. Row 1 (RED): query rate by query class (read/write/aggregation), error rate by error class (constraint-violation/timeout/connection-refused), p99 query latency by query class. Row 2 (USE): buffer-pool hit ratio with 95% threshold, WAL-write saturation, replication-lag in seconds, IOPS errors. The two rows together are 8 panels, all at the floor — neither row alone catches everything. Drill-downs go below into per-tenant query breakdowns and per-table buffer-pool occupancy. The dual-floor structure is what catches both broken-application-deploy errors (RED row) and tenant-induced buffer-pool eviction (USE row).

The pattern across all three: the floor row has 3-8 panels, each panel large enough to read at a glance, and the drill-down rows live below — visually smaller, denser, and intentionally not the first thing the eye lands on. The floor's panel count caps at around 8 because beyond that the 15-second leadership-reading budget breaks down (each panel gets less than 2 seconds of attention). A floor row with 12 panels is functionally a drill-down row that has been mislabelled as a floor; the discipline is to demote the secondary panels and keep the floor row tight.

Common confusions

  • "USE and RED are competing methods; you must pick one." They are complementary. RED is per-service, USE is per-resource. A user-facing service needs RED at the floor with USE drill-downs underneath; a database needs both at the floor; a hardware-bound service needs USE at the floor. The "competing" framing is wrong — they are different views of the same system, optimised for different audiences and different failure modes. The choice is not "which method"; the choice is "which method is the floor for this service".
  • "The 'errors' in USE and the 'errors' in RED are the same." They are not. USE errors are resource-level — disk read errors, NIC carrier errors, ECC corrections, dropped packets in the IP stack. RED errors are request-level — HTTP 5xx, gRPC INTERNAL, business-logic rejections, timeouts. A service can have zero USE errors and 30% RED errors simultaneously, because the failure mode is in the application code, not in the hardware. Always label which method's errors you are showing — "request errors" for RED, "resource errors" for USE — to avoid the on-call confusion at 03:00.
  • "USE saturation is the same as RED duration." They are causally linked but not identical. USE saturation is resource queueing (runqueue depth, in-flight I/O exceeding queue depth, SYN backlog) measured at the kernel layer. RED duration is request latency measured at the application layer. A service can have fully-saturated thread pools but fast request durations (because the work is cheap and queueing is brief) or low saturation with slow durations (because the slowness is in a downstream call, not local resources). The two are different numbers from different layers.
  • "USE works for microservices." USE was designed for resource-exhaustion-bound single-machine workloads — kernels, JVMs, databases, hypervisors. On a stateless microservice whose primary failure modes are downstream slowness and broken deploys, USE is at best a drill-down because the local resources almost never matter — the bottleneck has moved to network calls. Forcing USE to be the floor for microservices produces the USE-only-fintech failure mode from the previous section. RED was specifically invented for the microservices regime that USE could not cover.
  • "RED works for databases." RED is necessary but not sufficient for databases. A Postgres primary's RED panels (query rate, query error rate, query p99 latency) miss buffer-pool saturation, WAL-write saturation, replication-lag saturation, and IOPS-error counts — all of which are USE signals and all of which are leading indicators of query latency degradation. Databases need both methods at the floor in a two-row dashboard; the RED-only failure mode is the DeshDataco scenario from the previous section.
  • "The four golden signals are just RED with one extra." The four signals (latency, traffic, errors, saturation) are the union of RED and USE on the dimensions that matter for user-facing services — latency = duration from RED, traffic = rate from RED, errors take both methods' errors with appropriate scoping, and saturation comes from USE (typically the most-constrained resource rather than every resource). The four-signals frame is what you get when you take the diagnostic value of each method and project it onto the user's experience. RED on its own misses the saturation leading-indicator; USE on its own misses the request-shape signals. The four signals are the practical synthesis.

Going deeper

Why USE has 31 resources and RED has 3 numbers — the asymmetry of complexity

Gregg's USE checklist enumerates 31 distinct resources for a Linux server (CPU, memory, network interfaces, disks, controllers, kernel data structures, fork-and-exec rates, and more), each producing 3 numbers (U, S, E), for 93 cells per host. The asymmetry with RED's 3-numbers-per-service is not an accident — it reflects the difference in complexity between the resource layer and the request layer. The resource layer is genuinely high-dimensional: a modern Linux server has many independent resources, each with its own failure mode, and a comprehensive USE check has to enumerate them because there is no single "the resource is fine" summary that captures all 31. The request layer is genuinely low-dimensional from the user's perspective: a request either succeeded or failed, and it took some amount of time. RED's three numbers are the irreducible summary at the request layer; USE's 93 cells are the irreducible summary at the resource layer. A team that wants RED-style brevity at the resource layer is asking for something the resource layer cannot provide; a team that wants USE-style comprehensiveness at the request layer is asking for something the request layer doesn't need. The two methods' shapes reflect the structural complexity of the layer each one operates on.

USE in the Indian fintech regime — when the constrained resource is not on your machine

A subtlety worth pausing on: USE was designed for a regime where the constrained resource is on the machine — CPU, memory, disk, NIC. In modern fintech, the constrained resource is often not on your machine: it is the connection pool to NPCI's UPI endpoint, the rate-limit budget on the Flipkart partner API, the gRPC concurrency cap on the downstream ledger-write service, the database connection slot on the shared payments-postgres cluster. These constrained resources are accessed over the network and are saturated remotely, not locally. USE's framing — "enumerate the resources on the machine and read their three numbers" — does not naturally cover them. The adaptation is to extend USE's resource enumeration to external resources too: a row for "NPCI UPI connection pool" with U/S/E columns measured by "fraction of the local 64-connection-pool currently in use", "requests waiting for a free pool slot", "5xx responses from NPCI". A row for the "ledger-write gRPC concurrency cap" with U/S/E measured similarly. The discipline of USE generalises across the network if you let it; the team that does not generalise it ends up with USE dashboards that are honest about the local machine and silent about the much-more-likely failure points beyond it. Modern Indian-fintech USE dashboards that I have seen work well typically have 3-5 "external resource" rows alongside the local-resource rows, and the external rows are where the actual saturation events tend to happen.

The actual USE checklist on Linux, and what "utilisation" really means

Gregg's published USE checklist for a Linux server is worth knowing in detail. Resources include: CPU (utilisation = 100 - %idle from mpstat, saturation = runqueue depth from vmstat's r column, errors = ECC error count from mcelog), memory (utilisation = used/total, saturation = swap-in rate from vmstat's si/so, errors = ECC corrections), disk-per-device (utilisation = %util from iostat, saturation = await time and aqu-sz, errors = SMART error count), NIC-per-device (utilisation = bandwidth-in-use over interface-capacity, saturation = dropped packets from ifconfig, errors = errors from ifconfig), file-descriptor-table (utilisation = used/limit from /proc/sys/fs/file-nr, saturation rare, errors = EMFILE from application logs), kernel slab (utilisation from /proc/slabinfo, saturation harder to get, errors rare). The checklist also includes resources that surprise people: kernel mutex contention via lock-stats, scheduler runqueue migration rate via sar -w, kernel stack-depth saturation via /proc/<pid>/stack. The full checklist is at brendangregg.com/usemethod.html. Most teams that say they "use USE" implement maybe 8 of the 31 rows — a reasonable trade-off, but worth being honest about.

A subtle point that even experienced engineers get wrong: utilisation in USE is fraction of time the resource was performing work over the measurement window, not fraction of capacity used. For a single-threaded resource (a single CPU core, a single disk's queue head), the two coincide — at 100% utilisation, the resource was busy 100% of the time and is 100% loaded. For a multi-threaded resource (an 8-core CPU, a multi-channel SSD), the two diverge: the 8-core CPU at "50% utilisation" might be either "all 8 cores busy 50% of the time" or "4 cores fully busy and 4 idle" or some intermediate distribution. The aggregated number obscures the per-thread story. Modern USE practice on multi-threaded resources tracks utilisation per thread (per-core CPU utilisation, per-channel SSD utilisation) rather than aggregated, because the aggregated number can be 50% while one core is at 100% and the others are idle — a load-imbalance scenario that the aggregated utilisation hides. The same applies to thread pools, connection pools, and any pooled resource: utilisation tracked per-pool-member is honest; utilisation tracked as pool-busy-time / (capacity * elapsed) is an average that can hide imbalance.

The Razorpay UPI pattern — when both methods converge on the same constrained resource

A pattern observed at multiple Indian fintech SRE teams: a single underlying constrained resource produces signals on both the USE side and the RED side, and the two signals lead to the same diagnosis through different paths. Consider the Razorpay-pattern payments-router from the four-golden-signals chapter, where the constrained resource is the per-gateway connection pool to NPCI/ICICI/HDFC/SBI. When the SBI gateway slows down, two things happen simultaneously: (1) the USE saturation panel for the SBI connection pool rises from 30% used to 100% used as in-flight requests pile up waiting for free pool slots, and (2) the RED duration panel for the /v1/charge endpoint rises from p99=180ms to p99=2400ms because requests are blocking on connection-pool acquisition. Both panels go red within 30 seconds of each other. A team with USE-only sees the connection-pool saturation but not the user-impact magnitude; a team with RED-only sees the user-impact but does not know which gateway is the cause. A team with both at the floor reads the two panels side-by-side, draws the obvious causal arrow from "SBI pool saturated" to "p99 elevated", and routes load away from SBI within 90 seconds. The redundancy is not waste — the two methods are checking the same root cause through different observation positions, and the convergence of the two panels is itself a diagnostic signal that increases confidence in the hypothesis. A single-method dashboard would still catch the outage; a both-methods dashboard catches it and provides the cause-effect linkage in one glance. Why convergent dashboards reduce mean-time-to-mitigate even when single-method dashboards catch the outage: the on-call engineer's bottleneck is rarely detection — modern alerting catches outages within 60-90 seconds — but attribution. The 14 minutes between page-fired and load-rerouted at most teams is spent localising which gateway, which downstream, which code path is the cause. A both-methods dashboard collapses that attribution time because the resource panel and the request panel together form a near-direct pointer at the cause, where either panel alone leaves the engineer needing to query a second tool. Reducing attribution time is where the operational ROI of the dual-floor discipline shows up in MTTR numbers.

The Tail at Scale — why RED's duration is genuinely hard to do well

Jeff Dean and Luiz Barroso's 2013 CACM paper "The Tail at Scale" is the foundational text for understanding why RED's "D" — duration — is the hardest of the three letters to instrument honestly. The paper's core claim is that p99 latency at the leaf service compounds catastrophically when a request fans out to many leaf services in parallel — a request that hits 100 leaves and waits for the slowest, when each leaf has a p99 of 10 ms, has an end-to-end p99 closer to 100 ms because each parallel leaf has a 1% chance of hitting its own p99 and the probability of no leaf hitting its p99 is 0.99^100 ≈ 37%. The implication for RED dashboards is that per-leaf duration panels look fine while end-to-end duration looks broken; reading the per-leaf panels and concluding "we are healthy" is the wrong-floor failure mode applied within RED itself. The paper's recommended fix is hedged requests — send the request to two leaves, take whichever returns first, cancel the slower — which trades 2x leaf load for a dramatically tighter tail. RED dashboards in fan-out-heavy services therefore typically have two duration panels — per-leaf p99 and end-to-end p99 — because the two are different numbers and the gap between them is the diagnostic signal. The Dean-Barroso paper also explains why coordinated omission corrupts duration measurements at exactly the scale where the tail starts to matter: load generators that pause when the system is slow under-sample the slow requests that the dashboard most needs to see.

Where this leads next

The decision rule from this chapter — pick the floor by service shape — flows directly into the next chapter on dashboard hierarchy (/wiki/wall-dashboards-are-where-observability-touches-leadership), where the question is who reads the dashboard and what do they need to see in 15 seconds. USE-vs-RED is a per-service decision; dashboard hierarchy is a per-audience decision; the two compose into the full dashboarding discipline. Chapter 78 covers panel arithmetic — the discipline of pre-computing comparisons (burn rate, headroom-remaining, ratio-vs-baseline) so the floor dashboard does not require the reader to do mental math.

Part 13 (/wiki/wall-numbers-mean-nothing-without-targets) onwards converts the chosen floor signals — RED's duration becomes a latency SLO, RED's errors become an availability SLO, USE's saturation becomes a capacity-planning SLO — into formal contracts with error budgets and burn-rate alerts. The choice of floor (USE, RED, or both) is also the choice of which signals get SLO-fied, which is what makes this chapter a foundation for the SLO-and-alerting parts that follow.

Part 14 covers alerting hygiene; the four-class alert taxonomy from the four-golden-signals article composes naturally with the per-service-shape USE-vs-RED choice — services with RED at the floor get alerts on the RED four (rate-anomaly, error-burn, p99-SLO, traffic-drop), services with USE at the floor get alerts on the USE three (utilisation-headroom, saturation-threshold, error-rate-on-resource), and dual-floor services like databases get both alert sets. The alert hygiene chapter assumes you have already made the floor decision; without it, the alert taxonomy has no anchor.

Finally, the cross-curriculum thread: the four-golden-signals chapter (/wiki/the-four-golden-signals) is the synthesis of USE and RED for user-facing services; this chapter is the taxonomy that lets you choose between them when the four-golden-signals frame does not naturally apply (kernel-bound services, batch jobs, streaming consumers). Reading the two chapters in either order is fine; reading both is necessary because the four-signals frame inherits its structure from this one.

A worth-noting counterpoint that the curriculum will return to in Part 13: Charity Majors and the Honeycomb team have argued that both USE and RED are prefab dashboards optimised for the world before high-cardinality observability was tractable. The argument is that when you can ask arbitrary questions of high-cardinality event data — group by user-ID, group by request-attribute, slice by feature-flag — you no longer need a fixed three-or-five-number summary because you can compose the right summary on the fly. The critique has weight in the cases where high-cardinality observability is in place (Honeycomb, ClickHouse, OpenObserve) and where the team is fluent in writing ad-hoc queries during incidents. It has less weight in the cases where the dashboard's job is to be readable in 10 seconds by a sleep-deprived on-call engineer at 03:00 — that engineer is not going to compose a high-cardinality query under pressure; they need pre-built panels they have already learned to read. The right synthesis: USE/RED dashboards are the floor for fast on-call diagnosis, and high-cardinality query consoles are the drill-down when the floor is insufficient. Both are useful; neither replaces the other. Sleep-deprived humans at 03:00 still need a fixed-shape dashboard at the top of the stack, regardless of how rich the underlying event data is — and the choice between USE-floor, RED-floor, and dual-floor is exactly the per-service-shape decision this chapter has been about.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install flask prometheus-client requests
python3 use_vs_red_demo.py &
# In another terminal install wrk2 (brew install wrk2 / apt install wrk2)
wrk2 -t4 -c64 -R250 -d60s -s post.lua http://localhost:8000/charge
curl -s http://localhost:8001/metrics | grep -E '^(charge_|worker_)'

References