The USE method: utilization, saturation, errors
Aditi at Razorpay was paged at 02:14 IST. The payments API p99 had crossed 800 ms — four times the SLO. The CPU dashboard read 38% across the fleet. The memory dashboard read 51%. Disk and network looked normal. Two teammates had already spent forty minutes scrolling through Grafana, opening Jaeger traces, and arguing about whether to scale out. Aditi opened a single terminal, typed seven commands in two minutes, and announced the cause: the NVMe disk on the primary write replica was at saturation, not utilization — its queue depth was 38, errors counter was incrementing, and the median request was waiting 600 ms behind another request for the same disk arm. The fix was a two-line config change. The dashboard had been telling the truth and pointing at the wrong place.
The USE method is a three-question audit you run on every resource: what fraction of time is it busy (utilization), how much work is queued (saturation), and how many errors is it returning (errors). High utilization with low saturation is fine. High saturation at any utilization is the bottleneck. Errors at any level invalidate the other two. Most production performance bugs are a saturation problem reported as a utilization mystery.
Why utilization alone is misleading
Utilization is the metric every dashboard shows because it is cheap to compute and easy to draw: count the time a resource was busy, divide by elapsed time, render a 0–100% bar. The trouble is that "100% utilization" means very different things on different resources, and "40% utilization" can hide a queue 200 deep behind it.
A CPU at 100% utilization for a single thread that runs a tight loop is healthy — it is doing exactly what you asked and there is nothing waiting. A disk at 100% utilization with iostat's %util column is not the same: on a multi-queue NVMe device, %util saturates at any non-trivial parallel load because the kernel reports "device had at least one outstanding I/O" as 100%, while the actual device throughput might be 12% of its rated maximum. The same number means "fully busy with no queue" on one resource and "any queue at all" on another. A methodology that only looks at utilization will over-trigger on the second case and miss the first.
The deeper failure of utilization-as-only-signal is that it tells you nothing about the waiting you actually feel as latency. A web server that runs at 40% CPU with a 12-deep run queue is the source of every slow response in the system; its CPU dashboard is green and its p99 is on fire. Brendan Gregg formalised this gap in 2012 with a checklist short enough to memorise and complete enough to find every common bottleneck — three questions per resource, applied uniformly across CPU, memory, disk, network, and any other queueable resource in the box. The checklist is Utilization, Saturation, Errors — the USE method.
Why three questions and not one: utilization measures how much of the resource you are spending; saturation measures how much demand exceeds the resource's capacity to serve; errors measure correctness independent of either. A resource can be 100% utilized with zero saturation (perfectly busy, nothing waiting) — that is healthy. A resource can be 30% utilized with high saturation — that is a bottleneck where the resource is artificially limited (kernel locks, single-queue device, head-of-line blocking). Errors signal that the previous two readings were taken on a system that is silently failing — a 70% utilization on a NIC that is dropping 4% of packets is a cancelled measurement, not a healthy state.
A working USE audit on a Linux box
The method becomes useful when you have the exact commands committed to muscle memory. The list below is the audit Aditi runs at 02:14 — seven commands, one per resource, each yielding U / S / E directly.
# use_audit.py — minimal USE-method audit for a Linux box.
# Reads /proc and runs short subprocess samples; prints a USE table
# the operator can read in 30 seconds at 02:14 IST.
# Run: python3 use_audit.py
import json, re, subprocess, time
from pathlib import Path
def run(cmd, t=2.0):
return subprocess.run(cmd, shell=True, capture_output=True,
text=True, timeout=t).stdout
def cpu_use():
# Utilization: 1 - idle fraction across all CPUs (mpstat 1 1).
out = run("mpstat 1 1 | tail -n 1")
idle = float(out.split()[-1])
util = round(100.0 - idle, 1)
# Saturation: run-queue length and processes blocked on I/O (vmstat).
vm = run("vmstat 1 2 | tail -n 1").split()
runq, blocked = int(vm[0]), int(vm[1])
# Errors: throttled CPUs (cpufreq throttling indicates thermal/cap issues).
thr = run("dmesg --since '5 min ago' | grep -ci 'cpu.*throttl'").strip() or "0"
return {"U_pct": util, "S_runq": runq, "S_blocked": blocked, "E": int(thr)}
def mem_use():
mi = {l.split(":")[0]: int(l.split()[1]) for l in
Path("/proc/meminfo").read_text().splitlines()[:20]}
util = round(100.0 * (mi["MemTotal"] - mi["MemAvailable"]) / mi["MemTotal"], 1)
vm = run("vmstat 1 2 | tail -n 1").split()
si, so = int(vm[6]), int(vm[7]) # swap-in, swap-out KB/s
pgmajfault = int(run("grep '^pgmajfault' /proc/vmstat").split()[1])
oom = int(run("dmesg --since '1 hour ago' | grep -ci 'killed process'").strip() or "0")
return {"U_pct": util, "S_swap_io_kbps": si + so,
"S_pgmajfault_total": pgmajfault, "E_oom_kills": oom}
def disk_use(dev="nvme0n1"):
out = run(f"iostat -xy 1 1 {dev} | tail -n 2 | head -n 1").split()
if len(out) < 14: return {"U_pct": None, "S_aqsz": None, "E": None}
util = float(out[-1]) # %util column (last)
aqsz = float(out[-3]) # aqu-sz / await column (varies by version)
err = int(run(f"smartctl -A /dev/{dev} 2>/dev/null | "
f"awk '/Media_and_Data_Integrity_Errors/{{print $NF}}'") or "0")
return {"U_pct": util, "S_aqsz": aqsz, "E_media_errors": err}
def net_use(iface="eth0"):
line = run(f"sar -n DEV 1 1 | awk '/{iface}/' | tail -n 1").split()
rx_kbps = float(line[4]) if len(line) > 4 else 0.0
tx_kbps = float(line[5]) if len(line) > 5 else 0.0
drops_line = run(f"sar -n EDEV 1 1 | awk '/{iface}/' | tail -n 1").split()
rx_drop = float(drops_line[3]) if len(drops_line) > 3 else 0.0
backlog = int(run("ss -lntH | awk '{s+=$3} END{print s+0}'") or "0")
return {"U_kbps_rx": rx_kbps, "U_kbps_tx": tx_kbps,
"S_listen_backlog": backlog, "E_rx_drops_per_s": rx_drop}
def fd_use():
used, _, max_fd = open("/proc/sys/fs/file-nr").read().split()
used, max_fd = int(used), int(max_fd)
return {"U_pct": round(100.0 * used / max_fd, 1),
"S_alloc_failures": int(run("dmesg --since '1 hour ago' | "
"grep -ci 'too many open files'") or "0"),
"E": 0}
def main():
audit = {"ts": time.strftime("%Y-%m-%dT%H:%M:%S"),
"cpu": cpu_use(),
"mem": mem_use(),
"disk": disk_use(),
"net": net_use(),
"fd": fd_use()}
print(json.dumps(audit, indent=2))
if __name__ == "__main__":
main()
Sample run during Aditi's incident on the Razorpay payments-write replica (r6i.4xlarge, NVMe-backed Postgres primary):
$ python3 use_audit.py
{
"ts": "2026-04-23T02:16:42",
"cpu": {"U_pct": 38.4, "S_runq": 1, "S_blocked": 12, "E": 0},
"mem": {"U_pct": 51.2, "S_swap_io_kbps": 0, "S_pgmajfault_total": 41, "E_oom_kills": 0},
"disk": {"U_pct": 99.8, "S_aqsz": 38.4, "E_media_errors": 7},
"net": {"U_kbps_rx": 18420.1, "U_kbps_tx": 9210.3,
"S_listen_backlog": 0, "E_rx_drops_per_s": 0.0},
"fd": {"U_pct": 6.1, "S_alloc_failures": 0, "E": 0}
}
Reading the audit:
- CPU at 38% is the green dashboard everyone was looking at. But
S_blocked = 12means twelve processes are waiting on I/O, which is the first hint that the bottleneck is not on the CPU. WhyS_blockedmatters more thanU_pcthere: a process in theDstate (uninterruptible sleep, waiting on I/O) does not count toward CPU utilization but is unable to make progress; twelve of them sitting blocked is a strong signal of a downstream resource that is saturated. The CPU dashboard cannot see this because the CPU is genuinely idle — it is waiting too. - Disk
U_pct = 99.8%andS_aqsz = 38.4is the smoking gun. The disk is busy basically all the time, and the average queue depth is 38 — meaning every new request joins a queue with 38 already in it. Why a queue of 38 implies ~600 ms of waiting: at the NVMe device's offered service rate of ~60 IOPS for the random-write workload (after accounting for the device's small-block-write performance cliff), 38 queued operations represent 38 / 60 = 633 ms of pure queueing delay before the request even starts to be served. That number matches the p99 of 800 ms the API was reporting almost exactly, with the remaining ~170 ms being CPU + network round-trip. - Disk
E_media_errors = 7is the second smoking gun. The NVMe is reporting media-integrity errors, which means the device is degrading and the firmware is performing internal retries — each retry stalling the queue further. Errors invalidate the utilization reading entirely; even at 30% utilization, a device with media errors is an immediate replace-and-investigate. - Memory
pgmajfault_total = 41in the last sample is benign — major page faults at this rate are noise. Network and FDs are healthy — no backlog, no drops, no allocation failures.
The fix Aditi applied was to fail over the Postgres write replica to its standby (1-line pgctl promote), drain the original primary, and replace the NVMe device. p99 dropped from 800 ms to 110 ms within four minutes of the failover. The whole audit took 90 seconds to interpret because it had pre-computed answers to the three questions on every resource — there was no scrolling through dashboards looking for the smoking gun.
What this script does not do, and what production deployments must add: it samples a single second, which is too short for resources where utilization fluctuates rapidly (sample for 60 s if you can afford the latency); it does not break out per-CPU and per-disk-device numbers (essential when the box has 64 cores or 8 NVMes — one bad disk can hide in the average); it assumes Linux + NVMe (translate iostat, mpstat flags for your distribution). The skeleton is the contract: U, S, E, per resource, in JSON, in under three seconds.
The USE checklist for a typical Indian fintech box
Auditing CPU, memory, disk, network, and FDs is the start. The full USE checklist on a production Linux box covers more resources than most engineers have audited in their entire career, because most monitoring stacks ship with U-only dashboards for a tiny subset of them. The table below is the exhaustive list for a typical Razorpay / Zerodha / Flipkart x86 server, with the U / S / E source for each resource and the most common bottleneck mode it hides.
| Resource | U source | S source | E source | Common hidden bottleneck |
|---|---|---|---|---|
| CPU (per-core) | mpstat -P ALL 1 |
vmstat's r column, runqlat (BPF) |
dmesg thermal throttle |
One core 100%, fleet avg 40% — single-thread tail |
| Memory capacity | /proc/meminfo MemAvailable |
vmstat swap I/O, pgmajfault |
OOM killer in dmesg | Slow OOM via swap thrash before oom-killer triggers |
| Memory bandwidth | pcm-memory or perf stat -e LLC-loads,LLC-load-misses |
sustained > 75% of peak DDR | ECC errors in edac-util |
Numpy / BLAS workload starved by DRAM bandwidth ceiling |
| NVMe / disk | iostat -xy %util |
iostat aqu-sz, await ms |
smartctl Media_Errors, dmesg block I/O errors |
Saturation at 40% utilization on multi-queue device |
| Network bandwidth | sar -n DEV |
sar -n EDEV rxnocp, ss -tin cwnd |
ifconfig errors, ethtool -S rx_drops |
NIC ring-buffer overrun — bursts dropped at line rate |
| TCP listen backlog | n/a (capacity) | ss -lnt recv-q |
`nstat | grep ListenOverflow` |
| TCP conntrack table | conntrack -C |
n/a (binary fill) | `nstat | grep conntrack_drop` |
| File descriptor table | /proc/sys/fs/file-nr |
per-process RLIMIT_NOFILE headroom |
dmesg EMFILE / too many open files |
Service throttles silently when RLIMIT_NOFILE is hit before host max |
| Interrupts / softirqs | mpstat -I SUM |
cat /proc/softirqs per-CPU max |
ethtool -S rx_no_buffer |
One CPU pinned at 100% in softirq, other cores idle |
| Threadpool (app) | active threads | queued tasks | rejected_executions count | Java ExecutorService with bounded queue silently rejecting at peak |
| GC heap (JVM/Go) | heap_used / heap_max | GC pause duration | OOMError, runtime panic | p99 latency ≡ GC pause; CPU and dashboards look fine |
The pattern: almost every column has a saturation source that is harder to find than the utilization source, and almost every "performance mystery" lives in the saturation column of one of these resources. Aditi's incident lived in the disk row's saturation cell. A typical Hotstar IPL incident lives in the network bandwidth row's saturation cell (NIC ring-buffer drops at peak). A Zerodha market-open incident lives in the conntrack row's saturation cell (table fills at 09:14:55 IST, every new SYN is silently dropped at 09:15:00).
The reason saturation spikes near 100% utilization while utilization itself stays linear is queueing theory: response time at offered load ρ on an M/M/1 server scales as 1 / (1 − ρ), which is finite at ρ = 0.5 (response time = 2× service time), painful at ρ = 0.85 (≈ 6.7×), and unbounded as ρ → 1. The USE method's saturation column is precisely the queue-depth signal that lets you see the knee approaching before utilization tells you anything is wrong. This is why "CPU at 80%" can be a fire and "CPU at 100%" with a single-process tight loop can be perfectly fine — utilization alone does not separate the two cases; saturation does.
In practice this is why the SRE rule of thumb across Razorpay, Hotstar, and Zerodha is to alert on saturation crossing a threshold, not on utilization. A typical alert: "CPU run-queue length > 4× core count for 60 seconds" or "disk aqu-sz > 16 for 30 seconds" or "TCP listen recv-q > 0 for any sample". These thresholds map onto the queueing-theory knee, not onto a percentage. They fire before users notice and they do not false-positive on healthy single-thread CPU pegs. Teams that alert on utilization instead spend half their on-call budget chasing 95% CPU spikes that were a healthy batch job, while sleeping through 40% CPU incidents that were the real fire — the two failure modes that USE-with-saturation prevents at once.
A worked walk through the audit at the Zerodha 09:14 IST market open
The framework reads abstractly until you walk a second incident through it end-to-end, on a different stack, with a different culprit. Zerodha's order-matcher box at 09:14:55 IST on a typical equity-market open shows the audit moving from healthy to bottlenecked in fifteen seconds of real time, and the USE method is the only audit short enough to capture it before the operator has to make a decision.
At 09:14:00, sixty seconds before market open, the audit looks like the bottom of a hill: CPU U=8%, S_runq=0, mem U=42%, disk U=2%, net U_kbps_rx=2400, net E_drops=0, conntrack=18% full, FD=4%. Every column is green; the box looks idle. At 09:14:55 — five seconds before bell — the audit changes shape: CPU U=64%, S_runq=11; net U_kbps_rx=890000 (saturating the 10Gbps link's 9.4 Gbps usable throughput); conntrack=72% full and rising at 4000/sec; FD=18%. Utilization is climbing on every resource at once, which is the expected pre-open shape. At 09:15:00.000 — the bell — the audit changes character: net E_rx_drops_per_s = 240; conntrack=89% full; conntrack insertion failures = 1240 in the last second. The first signal is on the E column for net and the S column for conntrack. By 09:15:01, retail orders that arrived in the last 800 ms are silently dropped, the user-facing app reports "broker connection failed", and the USE-aware operator has the answer: bump nf_conntrack_max and add a per-CPU conntrack hash bucket sweep to the kernel boot params before the next session.
The audit is what made the answer reachable in 18 seconds instead of forty minutes. The default Grafana dashboard for that box rendered CPU, memory, network throughput, and disk IOPS — the four utilization columns most teams set up first. None of those four moved into red during the incident. Conntrack utilization, conntrack insertion-failure count, and net rx-drop count — the three USE columns the dashboard did not have — were the only signals that fired. The operator who knew the audit had answers; the operator who only had the dashboard would have rolled the previous day's release at 09:16, found nothing, escalated to the network team at 09:21, and the matcher would have been silently dropping orders for the entire morning session.
The audit took 18 seconds to interpret because every column was either fine, climbing-but-still-fine, or visibly broken — the framework forced the operator to look at conntrack at all, which the default Grafana dashboard for the matcher box did not show. The lesson the Zerodha SRE team took from this: the audit is run as a one-page paper checklist taped to the wall above the on-call desk, with one column per resource, U / S / E in three rows, and a yellow / red threshold against each cell. The page is updated when the team's understanding of the box changes, which happens roughly twice a year. Cheap, durable, and the antidote to the dashboard sprawl that makes a 02:14 IST incident take 40 minutes instead of 90 seconds.
When USE finds nothing — the limits of the method
USE is a checklist for resource-bound bottlenecks. It is excellent at finding "this disk / NIC / CPU is the limit". It is silent on three classes of problem that look like resource issues but are not:
1. Lock contention. Every CPU is at 38%, every disk is idle, but the application is glacial because two threads are fighting for the same mutex. USE will report all resources healthy. The signal you need is from perf lock, futex contention via bpftrace, or thread-state visualisation (off-CPU profiling — chapter 30 of this curriculum). Lock contention manifests as off-CPU time, which USE does not measure.
2. Synchronisation patterns (e.g. coordinated batching, head-of-line blocking). A web service whose worker pool is handling 10,000 fast requests fine but is occasionally blocked by one slow query that holds a worker for 4 seconds will show a healthy USE on every resource and a p99.9 of 4 seconds. The bottleneck is distributional, not resource-bound. The right tool is a flamegraph + per-request tracing (Zipkin / Jaeger), not USE.
3. Distributed / cross-host saturation. A microservice mesh where service A's p99 is bounded by service B's p99, which is bounded by service C's threadpool — every box passes USE, the system is sick. The right method is the inverse of USE — start at the user-facing latency and walk the dependency graph back, the RED method (Rate / Errors / Duration) per service, or distributed tracing.
USE composes with these other methods rather than replacing them. Brendan Gregg's pairing recommendation, which the rest of this curriculum follows: USE for resources, RED for services, off-CPU profiling for lock and sync issues, distributed tracing for cross-host. The mistake is to use USE as the only method; the other mistake is to drop it once you have flamegraphs.
The pragmatic order during a live incident: USE first because it is fastest (90 seconds, deterministic, narrows the search), then RED if USE is clean (per-service rate / errors / duration to find the slow endpoint), then off-CPU profiling if RED is clean (the lock or sync source), then distributed tracing if all three are clean (the cross-service waterfall). Skipping USE because "I already know it's a code issue" is the most common mistake — the operator who skips it finds out four hours later that conntrack was full all along, and the dashboards never said so. Why USE goes first even when the bug feels like code: USE is the only method that catches silent-saturation bugs (conntrack fill, NIC ring overrun, FD-table-full) where the application sees no error and no slowdown — the dropped requests never reach the application's instrumentation. RED, off-CPU, and tracing all measure things the application observes; USE measures the resources the application cannot see beyond the kernel boundary.
Common confusions
-
"Utilization is the only number I need to look at." Utilization tells you whether the resource is doing work, not whether work is waiting. A disk at 40% with queue depth 30 is the bottleneck; a CPU at 100% with run-queue length 1 is healthy. USE's whole point is that utilization is one of three signals, not the signal.
-
"Saturation is the same as utilization." Saturation measures queueing — how much demand exceeds the resource's ability to keep up. Utilization measures activity. A single-threaded tight loop on one core gives utilization 100% and saturation 0 (run-queue 1, the loop). A web server with 30 active requests on 8 cores gives utilization 80% and saturation 22 (run-queue 30, only 8 can run). The two numbers move independently and require different fixes: utilization-bound resources need horizontal or vertical scaling; saturation-bound resources need queue-management or concurrency-limit changes.
-
"Errors are a separate concern from performance, and
%utilfromiostatmeans the disk is full." Two failures of the same intuition. Errors invalidate every other reading on the same resource — a NIC reporting 4% rx_drops gives a utilization number that excludes the dropped packets, so the workload upstream is silently lower than it appears. And on multi-queue NVMe,%utilsaturates at any non-trivial concurrent load (it only reports "at least one I/O outstanding"); the device can serve many more in parallel. Useaqu-szoriodepthfromfiofor the saturation signal on NVMe, and always check the errors counter before trusting U or S. -
"USE is for hardware; software resources don't have U/S/E." Software resources — file descriptors, thread pools, GC heap, conntrack table, ephemeral port range — have all three. FD utilization =
file-nr/ max; FD saturation = per-processRLIMIT_NOFILEheadroom; FD errors =EMFILEindmesg. The framework is more powerful when applied to software resources because those are the bottlenecks dashboards usually miss. -
"If USE says everything is healthy, the system is fine." USE finds resource-bound bottlenecks; it does not find lock contention, request-distribution problems, cross-service issues, or capacity-planning regressions on long horizons. A clean USE audit narrows the search space; it does not close it. The next move when USE is clean is off-CPU profiling and distributed tracing, not a celebration. The same audit, sampled over weeks rather than during a fire, is also the input to a Universal Scalability Law fit and the throughput-vs-latency curve that prevents the next incident.
-
"My APM tool already does USE." Most APM tools (New Relic, Datadog, Dynatrace) ship CPU / memory / disk / network utilization out of the box, occasionally a saturation column, almost never an errors column for kernel-level resources. They read
/procnot/sys/fs/cgroup, they aggregate per-host not per-CPU or per-disk, they sample at 60 s not 1 s, and they do not include conntrack, FD table, threadpool queue depth, or GC pause. The default APM dashboard is a partial USE on a subset of resources at a coarse interval. That is fine for a 90-percentile-of-incidents view; it is not enough for the long-tail incidents where the bottleneck lives in the column the APM doesn't render. The discipline is to know which subset of USE your APM covers, and to fill the gap with a 100-line script you maintain yourself.
Going deeper
Why the M/M/1 response curve makes saturation the leading indicator
The reason saturation explodes at ρ ≈ 0.85 while utilization is still linear comes straight from M/M/1 queueing theory. For an M/M/1 server (Poisson arrivals, exponential service times, single server), expected response time R = 1 / (μ − λ) = (1/μ) / (1 − ρ), where ρ = λ/μ is utilization. At ρ = 0.5, R = 2× service time; at ρ = 0.85, R = 6.7×; at ρ = 0.95, R = 20×. The expected queue length L_q = ρ² / (1 − ρ) follows the same curve — flat at low load, vertical near 1. This is why "CPU 50%" is a different operating point from "CPU 90%" by an order of magnitude in queueing delay, even though linear-eyeball intuition says they are similar. The full derivation is in chapter 65 of this curriculum; the punchline for USE-method purposes is: saturation is the metric that captures the non-linearity, and USE's discipline of always reporting it is what makes the framework predictive rather than reactive.
Per-CPU and per-device USE — why aggregates lie
The mpstat -P ALL and iostat -x flags exist for a reason: aggregating across CPUs or disks can produce a healthy summary while one component is on fire. A 32-core box with one core pinned at 100% in softirq processing (a typical NIC misconfiguration where all RSS queues map to CPU 0) shows fleet-average CPU at 3.1%, even though one of the 32 cores is the bottleneck for the entire NIC's traffic. The same pattern applies to multi-NVMe boxes: average disk utilization at 25%, one disk at 99% with queue 50, the other three idle — the workload is bottlenecked on a single device and the average hides it. Always run USE per-component for any resource the box has more than one of: per-CPU, per-disk, per-NIC queue, per-NUMA-node, per-database-shard. The Razorpay incident in this article was on a single-disk replica, so per-device wasn't necessary; on a multi-NVMe Aurora box it would have been.
USE for software resources — the conntrack disaster pattern
The most under-audited resource in any Linux box is the conntrack table — the kernel's NAT/firewall connection-tracking table. Its U/S/E look like: U = conntrack -C / nf_conntrack_max; S = no direct metric (binary fill — once it's full, every new connection is silently dropped); E = nstat | grep conntrack (insertions failed). The disaster pattern: a Hotstar / Zerodha box configured with nf_conntrack_max = 262144 running fine at 100K connections, hits 250K during a traffic burst, the table reaches 95% full at 09:15:01, and every SYN packet from 09:15:02 onward is dropped with no visible error in any default dashboard. The CPU dashboard says 12%, the NIC says 60%, everything is "healthy" — and 40% of users cannot connect for the next four minutes until the table drains. The fix is nf_conntrack_max = 1048576 plus a hashsize bump. The audit habit: include conntrack -C in your USE script, and alert on > 75% utilization. Most production outages of the silent-drop class would never have happened if the team had a USE row for conntrack.
USE in containers, Kubernetes, and the silent-error column
Inside a container, /proc/cpuinfo shows the host's CPUs (not the cgroup limit), /proc/meminfo shows host memory (not the cgroup limit), and iostat shows host disks (not the per-volume IOPS limit imposed by the cloud). USE inside a container produces the wrong utilization denominators on every line. The fix is to read from the cgroup interface directly: /sys/fs/cgroup/cpu.stat for CPU time, /sys/fs/cgroup/memory.current and memory.max for memory, the cAdvisor-exposed throttle metrics for cgroup-imposed saturation. Kubernetes adds its own resource-bounds (CPU request, memory limit) on top of the cgroup view; the right denominator depends on whether the question is "is this container hitting its limit?" (use cgroup) or "is the host overloaded?" (use host metrics). Most production USE scripts running in containers report wrong utilization for at least one resource until they are rewritten against cgroup v2.
The E column has its own variant of the same problem: the gap between loud and silent error sources. Loud errors — EIO from a write, OOM kills in dmesg, ECC uncorrectable errors — are easy to count and almost always alerted on. Silent errors are the dangerous ones: a NIC ring-buffer overrun increments ethtool -S rx_no_buffer_count without producing a log line; a conntrack-table-full drop increments nstat ListenDrops with no syslog message; a TCP retransmit caused by a flaky middleware NAT increments nstat TcpRetransSegs and is invisible to the application. The right E source for each resource is the kernel counter that increments when the failure happens, not the application log line that may or may not exist. Brendan Gregg's USE-method wiki page lists these counters per-OS; the discipline is to check them in the audit script, not wait for the alert. The Zerodha Kite outage in October 2023 was a conntrack-fill-and-silently-drop incident; the USE audit would have flagged it; the application logs and Grafana dashboards did not because there was no log line to alert on.
The RED-USE-tracing rotation in a real incident-response loop
Aditi's incident took 90 seconds to diagnose because the audit was already a tool, not a method. The same incident at a less mature team takes an hour because the team rediscovers the audit each time. The discipline that compresses the timeline is to wire USE, RED, and tracing into a single muscle-memory rotation: at the page, RED on the user-facing service answers "is the bug uniform across endpoints, or one endpoint?"; USE on the host answers "is a resource saturated?"; if both are clean, off-CPU profiling answers "is a lock or sync the bottleneck?". The Razorpay SRE handbook formalises the rotation: RED → USE → off-CPU → distributed-trace, two minutes per step, thirty-second hard cutoffs that force the engineer to either find the signal or move to the next method. The handbook is one page. The result is an organisation where a P1 page goes from "let's open Grafana" to "USE shows disk saturation on replica 2" in the time it takes to load the dashboard. The methodology is the moat; the dashboards are interchangeable.
Reproduce this on your laptop
# Linux box; macOS / WSL2 will not have iostat, mpstat, sar, conntrack.
sudo apt install sysstat smartmontools conntrack
python3 -m venv .venv && source .venv/bin/activate
# (use_audit.py uses only stdlib — no pip install needed)
# Generate a synthetic load to see the audit move:
sudo fio --name=writeload --filename=/tmp/io.bin --rw=randwrite \
--bs=4k --size=2G --iodepth=32 --numjobs=4 --time_based --runtime=120 &
python3 use_audit.py # observe disk U_pct climb, S_aqsz climb
killall fio
Where this leads next
USE is the first methodology in Part 4 and the foundation for every audit method that follows. The chapters that build on it:
- /wiki/red-method-rate-errors-duration-for-services — the service-level counterpart to USE; what RED catches that USE cannot.
- /wiki/off-cpu-profiling-and-thread-state-analysis — the tool for the bottlenecks USE is silent on (lock contention, sync waits).
- /wiki/queueing-theory-mm1-mmc-and-the-knee-at-rho-0-85 — the math behind why saturation is the leading indicator.
- /wiki/the-tail-at-scale-and-coordinated-omission — why mean-utilization-based dashboards systematically under-report tail-latency incidents.
- /wiki/cgroup-v2-and-container-resource-accounting — running USE correctly inside Kubernetes pods.
- /wiki/conntrack-table-saturation-and-the-silent-syn-drop — the Zerodha-class incident pattern in detail.
- /wiki/iostat-versus-bpf-biolatency-for-disk-saturation — when
aqu-szlies and what to use instead. - /wiki/per-cpu-softirq-saturation-on-multi-queue-nics — the one-CPU-pinned bottleneck the box-average USE hides.
The arc across Part 4: USE finds the resource bottleneck, RED finds the service bottleneck, off-CPU profiling finds the lock bottleneck, and distributed tracing finds the cross-host bottleneck. A senior SRE rotates between these four methods like a doctor rotating between stethoscope, blood-pressure cuff, ECG, and X-ray — each is the right tool for one class of question, and the discipline is knowing which question you are asking. Aditi's incident was a stethoscope question, which is why USE answered it in 90 seconds; an off-CPU lock-contention incident would have taken her another hour with USE alone.
The deeper habit Brendan Gregg's framework instils, and that this curriculum returns to in Parts 5, 6, 7, 8, and 14: the metric you don't have is the metric the bottleneck lives in. If your dashboard does not include saturation columns, your bottlenecks live there. If it does not include error columns, your degraded-mode incidents live there. The fix is not more dashboards — it is a methodology that asks the right questions even when the dashboard is silent.
The single sentence to take from this chapter, and to write down on the on-call notepad before the next page: for every resource on every box, ask three questions, and never stop at the first one.
References
- Brendan Gregg, "The USE Method" — the canonical description, with per-OS checklists for Linux, Solaris, macOS, and FreeBSD; mandatory reading.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2.5 — Methodologies — USE in the context of every other systems-performance methodology; the trade-offs section is essential.
- Tom Wilkie, "The RED Method" (Grafana Labs, 2018) — the service-level companion methodology; understanding both is what separates a junior from a senior SRE.
- Linux
iostat,mpstat,sarman pages — the official source for column meanings; theaqu-szdefinition in particular is worth reading carefully. - Jeffrey Mogul, "Network Subsystem Design" (1995) — the original paper on multi-queue NIC saturation; explains why softirq-CPU pinning is the bottleneck dashboards miss.
bpftracereference forrunqlat,biolatency— the modern way to read saturation directly from the kernel's run-queue and block-I/O queue.- /wiki/the-methodology-problem-most-benchmarks-are-wrong — the previous chapter; methodology under controlled benchmark conditions is the dual of methodology under production firefighting.
- /wiki/wall-measuring-is-harder-than-optimizing — the wall this Part is built to climb; USE is the first concrete tool for climbing it.