The RED method: rate, errors, duration for services

Karan at Hotstar is paged at 21:48 IST during the IPL playoff between RCB and CSK. The catalogue API's p99 just crossed 1.6 s — the SLO is 800 ms. Every USE column on every host comes back green: CPU 34%, memory 41%, NICs at 22% utilisation, no NVMe saturation, conntrack 12% full. Twenty-three microservices sit between the user's tap on "watch live" and the manifest that the player downloads. Karan has eight minutes before the toss and twenty-five million viewers will hit the same path. He opens a single Grafana panel that is not a host dashboard at all — it is three numbers per service: the request rate, the error rate, and the duration percentile bands. Within ninety seconds he has narrowed the fire to one service: the recommendations sidecar is at 4× normal request rate, 2.1% error rate (up from 0.04%), and its p99 has gone from 40 ms to 1100 ms. The catalogue API is healthy; it is being held hostage by a downstream service that USE could never have found.

RED is a three-question audit you run on every service rather than every resource: how many requests per second is it serving (rate), what fraction are failing (errors), and how long do they take (duration distribution, not mean). USE finds the host bottleneck; RED finds the service bottleneck. The trap is that rate and errors lie under load, and duration's mean lies always — RED is only useful when you measure all three the way the next four sections will spell out.

Why USE is silent on the bug Karan is fighting

USE answers the question "is any resource on this host saturated, errored, or fully utilised?". When the bug lives in a single service that is talking slowly to a single downstream service, every box involved can be 30% utilised, 40% memory-used, with nothing queued and no errors — and the user-facing latency is on fire because of where the time is being spent inside the service mesh, not because any resource is the bottleneck.

The Hotstar fire is the canonical case. The catalogue API receives a request, fans out to seven services in parallel, waits for the slowest one before responding. If the recommendations sidecar's p99 jumps from 40 ms to 1100 ms because its backing Redis cluster is doing a key-eviction sweep, the catalogue API's p99 jumps to 1100 ms too — even though the catalogue API's own host is barely using its CPU. The recommendations service's host is also barely using its CPU, because the Redis cluster is the actual bottleneck and the recommendations service is just sitting in epoll_wait waiting for a reply. Three USE audits — catalogue, recommendations, Redis — each return "all green", because none of them is resource-saturated; the time is being spent in waiting, which USE does not measure.

A useful framing: USE audits the machine; RED audits the interface. A machine is healthy when its components are not exhausted; an interface is healthy when the work crossing it (requests in, responses out) is the right shape. When a request takes 1100 ms because of network round-trips and backpressure rather than because of CPU or disk burning, every machine in the path is healthy and every interface is degraded — the only way to see the failure is to instrument the interfaces. Service meshes (Linkerd, Istio, Consul Connect) emit RED metrics at every interface they sit on for exactly this reason: the mesh is the interface layer, and RED is the audit shape that interfaces require. Teams that adopt a mesh and then continue alerting only on host metrics have paid the cost of the mesh and not collected the benefit; the mesh is a RED instrument first and a routing layer second.

Service mesh fan-out where USE is silent and RED firesA diagram showing the catalogue API fanning out to seven downstream services. Six respond in 30-40ms. The recommendations service responds in 1100ms. The catalogue API total response time matches the slowest. Annotations show every host CPU at 30 percent.Catalogue API fans out — slowest dependency sets the p99catalogue APICPU 34%user-svc · 28 msgeo-svc · 31 msdrm-svc · 37 msreco-svc · 1100 msads-svc · 24 mssubs-svc · 33 mscdn-svc · 29 msredis clusterkey-evict sweepUSE on every host: green. RED on reco-svc: rate 4×, errors 2.1%, p99 1100 ms.
The catalogue API's p99 is the maximum of seven downstream p99s. One slow service drags the whole user-facing path. USE on every host returns green; RED on the recommendations service is the only signal that fires.

Why USE cannot see this: utilisation measures CPU-busy and disk-busy time, not request-wait time. A thread sitting in epoll_wait for 1100 ms waiting for a TCP reply contributes nothing to USE's utilisation column on any host. Saturation columns measure run-queue length, not request-queue length; the host's run-queue is empty because the thread is parked in the kernel. Errors columns count packet drops and OOM kills, not application-layer 500-responses. The bug is invisible to USE by construction — the bug lives in time spent waiting for a service, and USE measures resources.

The shape of the system that produces this gap — a fan-out request graph where each service has many downstream dependencies, every dependency has many of its own dependencies, and any one slow tail-call drags the entire user-facing latency — is the rule, not the exception, in 2026 Indian production. Razorpay's payment-init path touches eleven services. Zerodha's order-place path touches fourteen. Hotstar's manifest fetch touches twenty-three. Swiggy's order-place touches thirty-one. The probability that some service is degraded at any moment in a fleet of thirty services is high enough that catching it is a per-second monitoring problem, not a per-incident debugging problem. RED is the per-second monitoring layer; USE is the per-incident debugging layer; both are required.

The mathematics of fan-out makes this worse than it sounds. If a request fans out to 23 services in parallel and each independently has a p99 of 40 ms (so 1% of its requests take 40+ ms), the parent request's p99 is not 40 ms — the parent has to wait for the slowest of 23 children, so the probability that every child is faster than 40 ms is 0.99^23 ≈ 0.794. Roughly 21% of parent requests will hit at least one slow child. The parent's p99 is therefore set by the children's p99.95 or worse, depending on the fan-out width. A child whose p99 doubles barely moves the child's user-visible behaviour but ruins the parent's tail. This is exactly the dynamic Jeff Dean and Luiz Barroso documented in "The Tail at Scale" — and the reason RED's duration column must measure the tail, not the body, is that the tail is what propagates upward through fan-out. A monitoring stack that tracks mean(duration) per service in a fan-out architecture is structurally blind to the failure mode that produces 80% of user-visible latency incidents.

Rate, errors, duration — and the trap inside each one

Tom Wilkie at Weaveworks (later Grafana Labs) coined RED in 2015 as the service-level analogue of USE. The framing is the same — three questions per service — but each of the three has a measurement subtlety that the textbook formulation does not warn you about. Get any of the three wrong and the dashboard lies in a way that looks healthy.

Rate is requests per second to the service, broken out by endpoint. The trap is that "requests per second" without further qualification is ambiguous: requests attempted by the client, requests received by the server, requests completed, or requests successfully completed. The four numbers diverge under load. During the Karan incident, the recommendations service was receiving 4× normal rate and completing 1.4× normal rate — the gap was timeouts the client had given up on, requests the server was still chewing on, and 408-responses the server had cut short. A dashboard that showed "completed rate" looked almost healthy at +40%; a dashboard that showed "attempted rate" showed +300% and the answer was visible immediately. Always plot received rate at minimum, and ideally attempted rate from the client side too — the gap between them is the first sign of overload.

Errors is the fraction of requests that failed, where "failed" must be defined explicitly. HTTP 5xx is the obvious bucket; HTTP 4xx is the trap. A 4xx is a client error from the server's point of view (bad request, not found, unauthorised), but to RED's purpose what matters is which 4xx and whether the rate jumped. A baseline 0.5% rate of 404s is normal background noise. A jump from 0.04% to 2.1% in the 502-bucket — Karan's actual signal — is the recommendations service's HTTP/2 connections to Redis being killed by a connection-pool overflow on the client side. A jump in 401s during the same window would be an auth-service degradation, not a recommendations-service degradation. Bucket the error rate by status code, treat any class whose rate moves by more than 5× as a separate signal, and never average them all into one number — the average hides the bucket that fires.

A second rate-trap worth naming: outbound rate vs inbound rate for any service that fans out. The catalogue API receives 1500 RPS from its callers and emits 1500 × 23 = 34,500 RPS in downstream calls. A monitoring view that reports only the inbound 1500 understates the cost of one extra dependency in a fan-out path by 23×; teams reasoning about capacity from inbound rate alone routinely under-provision the downstream services that the fan-out actually loads. The fix is to instrument the outbound side of every service-to-service call (the gRPC client interceptor, the HTTP client middleware, the database query executor) with the same RED counters as the inbound side — so the rate column is inbound_rate and outbound_rate per (caller, callee) pair, not a single number. Most service mesh sidecars (Linkerd, Istio, Envoy) collect this for free; the discipline is to actually look at the outbound side when reasoning about capacity, not just the inbound side.

Duration is the distribution of response latency, not the mean. The single biggest mistake in service-level monitoring is to chart avg(response_time) and call it duration. The mean of a long-tailed distribution is dominated by the body, not the tail; a service whose p50 is 30 ms and p99 is 2.5 s has a mean of maybe 90 ms, and that 90 ms barely moves when the p99 climbs from 2.5 s to 5 s — because the tail is a small fraction of requests. Users feel the p99 (and the p99.9 for chatty clients that hit the service many times per page render); they do not feel the mean. The mean is the metric you chart when you want your dashboard to look healthy during an outage. Always plot the duration histogram — typically as a stack of percentile bands (p50, p90, p99, p99.9) — and alert on the high percentile, not the mean. Why the mean is structurally misleading for duration: a long-tailed latency distribution has the property that the contribution of the slowest 1% of requests to the mean is roughly p99 / 100 — about 25 ms in the example above. A doubling of p99 (from 2.5 s to 5 s) shifts the mean by about 25 ms (from 90 ms to 115 ms), a 28% change that is well within day-to-day variation. The same doubling shifts p99 by 100% and is the exact event the user sees as "the app is broken right now". The mean and the p99 are different physical quantities; charting the wrong one is a category error, not a granularity choice.

A working RED collector for a Python web service

The framework becomes useful when the collection code is short enough to memorise and the output shape is what the dashboard expects. The script below is the RED collector Karan's team runs as a sidecar on every Python service at Hotstar. It uses Prometheus-style counters and a HdrHistogram for the duration distribution — the only correct primitive for measuring percentiles under high request rates.

# red_collector.py — RED-method exporter for a Python web service.
# Adds three signals to any FastAPI/Flask/Starlette app:
#   - rate:     requests received per second per (endpoint, status_class)
#   - errors:   requests with status 5xx or 4xx per second per endpoint
#   - duration: HdrHistogram of latency in microseconds per endpoint
# Exposes /metrics in Prometheus exposition format.

import time, threading, collections
from hdrh.histogram import HdrHistogram
from fastapi import FastAPI, Request
from fastapi.responses import PlainTextResponse

# Per-endpoint state: counters + HdrHistogram (1 us .. 60 s, 3 sig figs).
_lock = threading.Lock()
_received  = collections.Counter()        # (endpoint, status_class) -> count
_completed = collections.Counter()        # (endpoint, status_class) -> count
_hist      = collections.defaultdict(    # endpoint -> HdrHistogram
    lambda: HdrHistogram(1, 60_000_000, 3))

def _status_class(code: int) -> str:
    if 200 <= code < 300: return "2xx"
    if 300 <= code < 400: return "3xx"
    if 400 <= code < 500: return f"4xx_{code}"   # bucket each 4xx separately
    if 500 <= code < 600: return f"5xx_{code}"   # bucket each 5xx separately
    return "other"

def make_app() -> FastAPI:
    app = FastAPI()

    @app.middleware("http")
    async def red(req: Request, call_next):
        endpoint = req.scope.get("route").path if req.scope.get("route") else req.url.path
        start_us = time.perf_counter_ns() // 1000
        with _lock:
            _received[(endpoint, "in_flight")] += 1
        try:
            resp = await call_next(req)
            cls = _status_class(resp.status_code)
        except Exception:
            cls = "5xx_500"
            raise
        finally:
            elapsed_us = (time.perf_counter_ns() // 1000) - start_us
            with _lock:
                _completed[(endpoint, cls)] += 1
                _hist[endpoint].record_value(min(elapsed_us, 60_000_000))
        return resp

    @app.get("/metrics", response_class=PlainTextResponse)
    def metrics():
        out = []
        with _lock:
            for (ep, cls), n in _completed.items():
                out.append(f'red_requests_completed_total{{endpoint="{ep}",status="{cls}"}} {n}')
            for ep, h in _hist.items():
                for p, label in [(50, "p50"), (90, "p90"),
                                 (99, "p99"), (99.9, "p999")]:
                    v = h.get_value_at_percentile(p)
                    out.append(f'red_duration_us{{endpoint="{ep}",pctl="{label}"}} {v}')
        return "\n".join(out) + "\n"

    @app.get("/health")
    def health(): return {"ok": True}

    @app.get("/items/{item_id}")
    async def item(item_id: int):
        # Simulate variable backend latency: median 30ms, occasional 1.1s tail.
        import random, asyncio
        delay = 0.030 + (1.07 if random.random() < 0.005 else 0.0)
        await asyncio.sleep(delay)
        if random.random() < 0.002: raise Exception("backend timeout")
        return {"item_id": item_id, "title": "IPL playoff RCB vs CSK"}

    return app

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(make_app(), host="0.0.0.0", port=8080, log_level="warning")

A real run during a synthetic load test against this service (wrk2 -R 2000 -t 4 -c 200 -d 30s --latency http://127.0.0.1:8080/items/42) produces the following /metrics output ten seconds in:

$ curl -s http://127.0.0.1:8080/metrics
red_requests_completed_total{endpoint="/items/{item_id}",status="2xx"} 19847
red_requests_completed_total{endpoint="/items/{item_id}",status="5xx_500"} 41
red_requests_completed_total{endpoint="/health",status="2xx"} 213
red_duration_us{endpoint="/items/{item_id}",pctl="p50"} 30215
red_duration_us{endpoint="/items/{item_id}",pctl="p90"} 31104
red_duration_us{endpoint="/items/{item_id}",pctl="p99"} 1098239
red_duration_us{endpoint="/items/{item_id}",pctl="p999"} 1112447
red_duration_us{endpoint="/health",pctl="p50"} 187

Reading the output:

The middleware is 25 lines and adds roughly 1.2 µs of overhead per request on a c6i.xlarge (measured by running with and without the middleware under a 50k-RPS load). That cost is two orders of magnitude smaller than any reasonable HTTP request, so the RED collector is cheap enough to run in production permanently — which is the point. Sampling RED data only during incidents misses the baseline, and without the baseline the alert thresholds are guessed rather than derived. Run RED at every percentile, on every endpoint, all the time, and the cost is a rounding error on the request budget.

The RED + USE rotation in a real fire

The Karan fire walks through the methodology end-to-end. The first action when paged is not to open a host dashboard; it is to open the per-service RED dashboard, which has a row per service with three columns (rate, error rate, p99 duration) and a sparkline for each. Services whose rate, errors, or p99 are statistically anomalous compared to their last 7-day baseline are flagged. The rotation is:

  1. Open the user-facing service first. The catalogue API's RED panel: rate is +12% (IPL traffic ramp, expected), error rate is +0.3% (slight increase, possibly cascade), p99 is at 1.6 s (red, the SLO is 800 ms). The fire is here. But this service has 23 dependencies; the cause is somewhere downstream.
  2. Walk the dependency graph backwards. Hotstar's RED dashboard groups services by call-graph depth from the catalogue. At depth 1, six services look normal. The seventh — recommendations — has rate +300%, error rate jumped from 0.04% to 2.1%, p99 at 1100 ms. Strong signal: this is the bottleneck, not just a victim.
  3. Stop and confirm before acting. The recommendations service is at +300% rate. Either traffic shifted to it, or its retry budget exploded, or its caller is timing out and re-issuing. The dashboard shows 4× attempts and 1.4× completions — the gap is timeouts being retried, which is the second signature of overload (the first being the rate-vs-completion gap itself). The cause is downstream of recommendations.
  4. Drop one level deeper. Recommendations talks to Redis and one ML feature-store service. Redis's RED panel: rate is normal, error rate is 4.7% in the 5xx_504 bucket — Redis is timing out recommendations' requests. The Redis cluster's USE panel: CPU 12%, memory 78%, but the redis_keys_evicted_total counter is climbing at 800k/sec. This is the actual bug — a misconfigured TTL on a new feature is making the cluster do an emergency LRU sweep, which blocks the main thread for 200 ms at a time, which times out recommendations, which retries, which 4×s the offered load, which makes the sweep worse. The fire is a feedback loop, and it started 11 minutes before Karan was paged.
  5. Apply the bounded fix. Karan disables the new feature flag (the one that broke the TTL config) at 21:51. The recommendations rate drops back to normal in 15 seconds. The catalogue API's p99 returns to 60 ms. The toss happens at 21:55. Root-cause analysis the next day produces a code change to the TTL setter and a Grafana alert on Redis evictions/sec.

The audit took six minutes. Without RED — only USE — the same audit would have looked at 23 host dashboards in sequence, found nothing remarkable on any of them (every host was at 30% utilisation, including the Redis hosts whose CPU was idle while the eviction sweep happened), and would have concluded "the network must be slow" and started rolling back releases. The toss would have started while half the user base had a broken player. RED's contribution is not that it gave more information than USE; it is that it organised the information by service, which is the unit of failure in a microservice system, rather than by host, which is the unit of resource provisioning.

RED-then-USE rotation during the Hotstar fireA timeline showing five steps over six minutes. Step 1 catalogue API RED panel at 21:48. Step 2 walk dependency graph at 21:49. Step 3 inspect recommendations RED at 21:50. Step 4 inspect Redis USE counters at 21:51. Step 5 apply fix at 21:51:30. Below the timeline, the user-facing p99 line drops from 1600ms to 60ms.Six-minute audit during the IPL fire21:48page21:48:30RED oncatalogue21:49walk deps→ reco-svc21:50RED on reco→ Redis21:51USE on Redis→ evictions21:51:30flag offp991600 ms60 msRED finds the service; USE confirms the resource; fix is one flag
The RED-then-USE rotation. Each step takes roughly a minute. The same incident without RED would have spent an hour on host dashboards and missed the toss.

The lesson the Hotstar SRE team encodes into onboarding: RED before USE for distributed systems. USE is the right tool when one host is on fire (out of memory, NIC saturated, disk dying); RED is the right tool when the user is on fire and the cause is somewhere in a thirty-service mesh. Indian production microservice estates make the second case at least as common as the first, which is why every team at Razorpay, Zerodha, Hotstar, and Swiggy is migrating from USE-only dashboards to RED-first-then-USE dashboards. The migration is not technical (the data is the same Prometheus counters); it is organisational — RED panels are organised by service-and-endpoint instead of by host, which forces the team to know its own service graph cold, which is a prerequisite for being on call in a microservice estate at all.

What makes the rotation work in practice — and what most teams underestimate — is that the RED dashboard must be service-graph-aware, not just service-list-aware. The catalogue API has 23 dependencies; the recommendations service has 6 dependencies; Redis has 0. Walking the dependency graph backwards manually during a fire takes minutes the on-call does not have. The mature version of the dashboard renders the dependency graph as a node-link diagram, colours each node by its current RED status (green / yellow / red against its baseline), and lets the on-call click a red node to drop into its RED detail panel. The graph is generated nightly from distributed-tracing data — every parent → child call observed in the last 24 hours becomes an edge — so it stays current without anyone maintaining a config file. Razorpay calls this the "service map", Hotstar calls it the "dependency dashboard", Zerodha calls it the "topology view"; the substance is the same. Teams that have it can do the Karan rotation in 90 seconds; teams without it are clicking through a flat alphabetised list of 200 service names trying to remember which six are downstream of the catalogue.

Common confusions

Going deeper

The four golden signals — Google's superset of RED

Google's SRE book extends RED with a fourth signal — saturation — to produce the four golden signals (rate, errors, duration, saturation). Saturation here is service-level saturation: thread-pool depth, queue length, in-flight request count. The Karan fire showed why it matters: the recommendations service had 4× normal received rate but only 1.4× completed rate; the gap was in-flight requests stacking up in the service's request queue. Adding red_requests_completed_total{status="in_flight"} (which the collector above maintains) and alerting on it crossing a threshold catches the same overload one step earlier — before the duration distribution starts widening, you can already see the queue filling. Most production teams converge on RED + saturation = four signals, with saturation as a thread-pool / connection-pool / queue-depth gauge specific to the service's runtime. The choice of "RED" vs "four signals" is mostly a vocabulary preference; the substance is the same.

Per-endpoint RED, not per-service RED

A service typically exposes 5–50 endpoints, each with its own latency profile. The catalogue API has a /manifest endpoint that is the IPL hot path (1500 RPS, p99 60 ms target) and an /admin/reindex endpoint that runs once an hour (0.0003 RPS, p99 30 s acceptable). Aggregating those into one "catalogue API duration" panel shows a bimodal histogram dominated by whichever endpoint had more requests in the window — almost always the hot one — and hides the slow one entirely. The discipline is to track RED per endpoint, alert per endpoint, and only aggregate when computing service-level objectives across endpoints with shared SLOs. Most observability tools support this trivially via the Prometheus label model (endpoint="..."); the cost is a slight increase in cardinality and a discipline of normalising endpoint paths (e.g. /items/{id} not /items/47) before they hit the metric. The Razorpay rule of thumb: any endpoint with >0.1 RPS sustained gets its own RED panel.

RED in non-HTTP contexts: gRPC, Kafka, database pools

The same three questions apply at every request-response boundary. For gRPC, the grpc.method and grpc.status_code labels replace the HTTP equivalents; OpenTelemetry's gRPC instrumentation gives RED out of the box. For Kafka request-response patterns (request topic → response topic), the rate is the request-topic produce rate, errors are dead-letter-queue messages plus timeouts, duration is the round-trip from request publish to response consume — measurable via correlation IDs in headers. For database connection pools (PgBouncer, HikariCP, Go's database/sql), rate is queries per second, errors are query failures (broken connections, query syntax errors, lock timeouts), duration is query execution time — the connection pool's own metrics expose all three. The reason RED generalises is that any boundary where work crosses from one component to another has the same three questions: how often does work cross? how often does it fail? how long does it take? Whether the boundary is HTTP, gRPC, Kafka, SQL, or even an in-process actor system, the audit shape is identical.

SLOs, burn-rate alerts, and synthetic monitoring as the layer above RED

A typical SLO at Hotstar reads: "99.5% of catalogue-API GET requests completed in < 800 ms over a rolling 28-day window". This is RED's three numbers compressed into one: the rate of measured requests, the errors budget (0.5% of requests are allowed to fail or be slow), and the duration threshold (800 ms p99.5). The SLO is a contract between the service team and its consumers; the RED dashboard is the moment-to-moment telemetry that reveals whether the SLO is being met. Burn-rate alerting — "the error budget for this 28-day window is being consumed 14× faster than sustainable" — is the standard way to alert on RED data with a long horizon, instead of the hair-trigger threshold alerts that produce alert fatigue. Most teams that have RED for two years end up adopting SLO-based alerting on top of it, because the per-second thresholds are too noisy and the per-week aggregates are too slow. The SLO-with-burn-rate is the equilibrium that production-grade SRE arrives at; RED is its raw substrate.

The flip side: RED measures what the service sees; it does not measure what the user sees. A CDN failure that drops 30% of edge requests before they reach the origin produces a healthy RED on the origin and a fire in user experience. The complement is synthetic monitoring — a test runner in a different network (datacentre, cellular network, third-party probe service) that drives the same end-to-end transactions a real user would and records the rate, errors, and duration of those tests. Hotstar runs synthetic probes from cellular networks in twelve Indian cities every 30 seconds, against the same endpoints the player calls. When the city-level synthetic p99 jumps for users in Hyderabad while the origin's RED is healthy, the fire is between the user and the origin — typically CDN, DNS, or ISP routing — and no amount of RED on the origin will find it. The pairing rule: RED on every service for "is the service healthy?", synthetic probes from the user's network for "is the user healthy?", SLOs with burn-rate alerts on top for "are we honouring our contract?", and the gap between any two of those layers is exactly the failure mode the third one cannot see.

Cardinality, retention, and the cost of RED at Indian scale

The naive RED implementation explodes cardinality on the first day a junior engineer adds an unbounded label. A single endpoint label that is not normalised — /items/47 instead of /items/{id} — turns into 10 million distinct time series in a fortnight. A user_id label turns into 800 million. Prometheus and most metrics backends bill by active series; the cardinality blowup turns a ₹40k/month observability bill into ₹4L/month overnight, and that is before the queries start timing out. The discipline is to cap label cardinality at code-review time: every metric label must have a known, bounded set of values (status codes, normalised paths, service names, region codes), and high-cardinality dimensions (user IDs, transaction IDs, raw URLs) belong in tracing or logs, not metrics. The Razorpay rule: any new metric label must list its expected cardinality in the PR description, and if the answer is "unbounded" the PR is rejected. Retention is the second cost lever — RED data at 1-second resolution for 28 days is rarely needed; most teams keep 1-second for 6 hours, 1-minute for 7 days, 5-minute for 90 days, and use Thanos or VictoriaMetrics for the long-term aggregate. Get cardinality and retention right on day one and RED stays affordable; get either wrong and the dashboard either lies (because the backend dropped series) or costs more than the engineering team it is monitoring.

A related implementation pitfall is choosing the wrong Prometheus metric type. Rate and errors must be Counter (monotonically increasing); the rate(...) function in PromQL only works on counters. Duration must be a Histogram (a set of bucketed counters), not a Summary and not a Gauge. Histograms aggregate across instances correctly — histogram_quantile(0.99, sum(rate(red_duration_us_bucket[5m])) by (le)) gives the fleet-wide p99. Summaries do not aggregate; their per-instance percentiles cannot be combined into a fleet-wide percentile (you cannot compute a p99 from twelve other p99s without the underlying samples). The HdrHistogram primitive shown in the collector above is the correct underlying structure; the Prometheus exposition layer just needs to render it as a Histogram with the right bucket boundaries — typically powers of two from 1 µs to 60 s. Get the type right on day one and the dashboard composes; get it wrong and you discover three months in that your fleet p99 has been a lie.

Reproduce this on your laptop

# Linux/macOS; FastAPI + Uvicorn + HdrHistogram + wrk2.
python3 -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn hdrh
# In one terminal:
python3 red_collector.py
# In another terminal: install wrk2 (https://github.com/giltene/wrk2) then:
wrk2 -R 2000 -t 4 -c 200 -d 60s --latency http://127.0.0.1:8080/items/42
# Observe the p99 in wrk2's output match the p99 from /metrics within ~5%.
curl -s http://127.0.0.1:8080/metrics | grep duration

When RED finds nothing — the limits of the method

RED is excellent at finding "this service is the slow link" or "this service is dropping more errors than yesterday". It is silent on three classes of problem that look like service issues but are not. The on-call who knows where RED ends saves the most time during a fire — they stop banging on the RED dashboard and switch tools at the right moment, instead of staring at a green panel for fifteen minutes wondering why it is green when the user is unhappy.

1. Per-tenant or per-user degradation that averages out. If 0.5% of users at Razorpay belong to a single high-volume merchant whose webhook URL is timing out, the merchant sees a 100% failure rate while the service-wide error rate barely moves from 0.04% to 0.54%. RED at the service level is silent; RED bucketed by tenant_id would catch it, but tenant_id is high-cardinality and most teams cannot afford it. The complement is logs-based alerting on per-tenant error rates, sampled rather than continuous, and traces filtered by tenant for the diagnostic step. This is the failure mode where RED's aggregation, which is its strength elsewhere, becomes its weakness — and it is also why most "tenant X is broken" complaints arrive via the support channel, not via paging.

2. Slow degradation under the threshold. RED with absolute thresholds ("alert when p99 > 800 ms") catches sudden jumps but misses gradual drift. A service whose p99 has been climbing 5 ms per week for six months will not page on any single day, but at the end of the half-year is at 1.5 s with no warning shot. The fix is rate-of-change alerts in addition to threshold alerts, and weekly RED-trend reviews where the team plots each service's RED metrics over a 90-day rolling window. Most teams add this only after one drift-style incident teaches them the lesson; the discipline is to add it before.

3. Bugs whose latency is fine but whose semantics are broken. A recommendations service that returns the same recommendations to every user (because a cache key is hashed wrong) has perfect RED — rate normal, errors zero, duration p99 30 ms — and is shipping a useless product. RED is a performance method, not a correctness method; it cannot tell you that the answers are wrong, only that they were returned quickly and without HTTP errors. Correctness needs end-to-end synthetic tests that assert on the response content (not just the status code), feature flags with shadow-traffic comparison, and product analytics on user behaviour. RED is necessary but not sufficient; pretending it is sufficient is how teams ship semantically broken services that look healthy on every dashboard.

A fourth, more subtle limit worth knowing: RED is silent on client-perceived latency vs server-measured latency. The duration histogram is measured from the moment the server's middleware starts the timer to the moment it returns the response — which excludes TCP setup, TLS handshake, request body transmission, and response body transmission to the user. For a user on a 4G connection in Patna with 180 ms RTT to the Mumbai datacentre, the server-measured p99 of 60 ms is an honest measurement of a small fraction of the total user-visible latency. The complement is real-user-monitoring (RUM) — JavaScript in the browser or instrumentation in the mobile app — that measures the full timeline from tap to render. RUM and RED disagree by 200–500 ms on average for Indian users, and the gap is the connection setup the server cannot see. Both numbers are correct, they just answer different questions; the RED p99 is the server SLO, the RUM p99 is the user experience.

The pragmatic order during a live incident: RED first because it is fastest at narrowing to a service (90 seconds), USE second on the suspected service's host to confirm the resource (90 seconds), distributed tracing third to find the code path inside the service (5 minutes), logs fourth to find the per-tenant or per-input pattern (15 minutes). Each step takes longer than the previous one and yields finer-grained information; the discipline is not to skip steps. The on-call who jumps straight from RED to logs spends fifteen minutes grepping while the actual answer was visible in the host's USE panel.

Where this leads next

RED is the second methodology in Part 4 and the first one that thinks about distributed systems rather than single hosts. The chapters that build on it:

The arc across Part 4 — methodology — is: USE for resources (chapter 24), RED for services (this chapter), off-CPU for locks and waits (chapter 30), distributed tracing for cross-service waterfalls (chapter 31). A senior SRE rotates between these four lenses in roughly that order during a fire, and the discipline is knowing which lens to point at the problem at each step rather than reaching for the same lens every time.

The same arc explains why a methodology-focused part of the curriculum sits before the parts on profiling, eBPF, and capacity planning rather than after. Tools without a methodology are noise — perf record produces a 200 MB file that the on-call cannot read at 02:14 IST without first knowing which service to record. RED is the layer that points the tools at the right place; once it has narrowed the fire to one service, every later chapter (flamegraphs, off-CPU, queueing, capacity) is the answer to a question that RED has already framed.

The deeper habit RED instils: organise telemetry by the unit of failure, not by the unit of provisioning. Hosts are the unit of provisioning — that is the unit your cloud bill is denominated in, the unit your infrastructure team thinks about, the unit dashboards historically defaulted to. Services and endpoints are the units of failure — that is what users care about, what SLOs are written against, what RED measures. The migration from host-organised dashboards to service-organised dashboards is the same migration as the one from "the box is the system" mental model to "the request graph is the system" mental model. Once that switch flips, RED is obviously primary and host metrics are obviously the supporting cast — and a 21:48 IST page during the IPL is solvable in six minutes instead of an hour.

References

  1. Tom Wilkie, "The RED Method" (Grafana Labs, 2018) — the canonical talk that introduced RED; the slides include the exact dashboard layout most teams have since adopted.
  2. Google SRE Book, "Monitoring Distributed Systems" (Beyer et al., 2016) — the four golden signals (rate, errors, duration, saturation) with case studies from Google's production.
  3. Gil Tene, "How NOT to Measure Latency" (2015) — the foundational talk on coordinated omission and why mean-based duration metrics are wrong; mandatory before you trust any p99 number.
  4. HdrHistogram by Gil Tene — the data structure that makes percentile-based RED feasible at high request rates.
  5. Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2.5 — Methodologies — RED in the context of every other systems-performance methodology, with explicit comparison to USE.
  6. Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — RED as one of three pillars (metrics, logs, traces); the connective tissue between RED and tracing.
  7. /wiki/use-method-utilization-saturation-errors — the host-level companion methodology; using RED without USE leaves resource bottlenecks invisible.
  8. /wiki/the-methodology-problem-most-benchmarks-are-wrong — the controlled-benchmarking dual to RED's production firefighting role.