Rate limiting: token bucket, leaky bucket
PaySetu's public payments API publishes a quota of "100 requests per second per merchant". A merchant integration team writes a batch settlement script that sleeps 10 ms between calls. At 02:14 IST the script wakes up after a 90-second pause, fires 380 calls in the first 50 ms because the merchant's local clock had drifted, and gets 280 of them rejected with 429 Too Many Requests. The merchant escalates: "your docs say 100 RPS — we sent 380 in 50 ms which averages to 100 RPS over the second, why are you rejecting?" The answer is that PaySetu's gateway runs a leaky-bucket limiter that shapes traffic to a smooth 100 RPS regardless of how requests arrive, while the merchant assumed a token bucket that lets you burst against accumulated allowance. Same number on the contract, two different mechanisms, and the difference is the gap between an integration that works on the second try and one that pages a developer at 02:14. The two algorithms agree on the long-run average and disagree on every other property: burst tolerance, queueing delay, fairness across clients, and what the rejection pattern looks like under sustained overload.
Token bucket and leaky bucket are dual algorithms with the same long-run rate but opposite philosophies. Token bucket accumulates allowance up to a cap and lets a client spend the cap as a burst — bursty traffic is allowed, idle time is rewarded. Leaky bucket meters every request through a fixed-rate output queue — bursty arrivals are smoothed (or dropped), idle time is forfeited. Production limiters are almost always token-bucket variants (GCRA, sliding-window-counter) because real APIs need to absorb bursts; leaky-bucket survives mostly in router queues and ingress shaping.
Two buckets, two stories
A bucket is a simple object: a capacity, a current level, and a rule for adding and removing. The two algorithms differ only in what flows in and what flows out.
Token bucket. Tokens are added at a steady rate r (e.g. 100 per second). The bucket has a cap C (e.g. 200). Each request consumes one token; if the bucket is empty, the request is rejected (or queued, depending on the variant). Idle time accumulates tokens up to the cap. A client that has been quiet for 2 seconds with r=100, C=200 arrives with the bucket full at 200 tokens — it can fire 200 requests instantly, then is throttled to the steady-state r once the bucket drains.
Leaky bucket. Requests are added to the bucket as they arrive. The bucket has a cap C. Requests drain at a steady rate r — exactly one request leaves the bucket every 1/r seconds, no matter how full the bucket is. If a request arrives and the bucket is full, it is rejected (or, in the queueing leaky-bucket, the bucket grows but the output rate stays fixed — so the request is delayed, not rejected). The bucket enforces a smooth output regardless of arrival pattern.
Why these are duals: token bucket controls when the bucket can be emptied (request rate); leaky bucket controls when the bucket can be drained (output rate). Tokens flow in at r, requests flow out at r. The same r parameter, the same C parameter, but the polarity of the flow is opposite — and that flips every secondary property: burst handling, queueing, fairness, latency tail.
What the two algorithms actually do under a burst
The clearest way to see the difference is to draw the response to a step input: a client is silent for 5 seconds (allowing the token bucket to refill), then issues 200 requests in 100 ms.
- Token bucket (
r=100/s, C=200): the bucket holds 200 tokens at the start of the burst. The first 200 requests succeed instantly, draining the bucket to 0. After the burst, the bucket refills at 100/s, so the next 100 requests issued in the following second succeed at the steady rate. The 5 idle seconds were banked and spent in 100 ms. - Leaky bucket (
r=100/s, C=200): the bucket is empty at the start of the burst. Requests pour in at 2000/s for 100 ms, but drain at 100/s. After 100 ms, the bucket has accumulated200·1 − 100·0.1 = 190queued requests; the 191st arriving request finds the bucket full and is rejected. From that point on, exactly 100 requests/s drain through the output, and the rest are rejected. The 5 idle seconds were forgotten.
Both algorithms emit roughly 100 requests in the first second, but token-bucket emits 200 in the first 100 ms (then nothing for 1 second) while leaky-bucket emits 10 in the first 100 ms and 90 in the rest of the second. From the upstream service's point of view, token bucket is bursty and leaky bucket is smooth. From the client's point of view, token bucket rewards being well-behaved (idle time turns into burst credit), while leaky bucket punishes the burst regardless of prior behaviour.
Why this matters for the user-visible experience: a user who taps a button after sitting on the screen for 5 seconds should not be punished for the next 5 seconds because they "saved up" requests they did not make. Token bucket matches this intuition. A backend service that genuinely cannot absorb bursts (e.g. a downstream that has a fixed-size connection pool) needs leaky bucket to protect itself. The right choice depends on which side of the API you are protecting.
A working implementation — both algorithms in one file
# rate_limiters.py — token bucket and leaky bucket on the same interface
import time, asyncio, random
from dataclasses import dataclass
@dataclass
class TokenBucket:
rate: float # tokens added per second
capacity: float # max tokens
tokens: float = 0.0
last: float = 0.0
def __post_init__(self):
self.tokens = self.capacity
self.last = time.monotonic()
def try_acquire(self, n: float = 1.0) -> bool:
now = time.monotonic()
elapsed = now - self.last
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last = now
if self.tokens >= n:
self.tokens -= n
return True
return False
@dataclass
class LeakyBucket:
rate: float # requests drained per second
capacity: float # max queue depth
level: float = 0.0
last: float = 0.0
def __post_init__(self):
self.last = time.monotonic()
def try_acquire(self, n: float = 1.0) -> bool:
now = time.monotonic()
elapsed = now - self.last
self.level = max(0.0, self.level - elapsed * self.rate)
self.last = now
if self.level + n <= self.capacity:
self.level += n
return True
return False
async def burst_test(name, limiter, n_requests=200, burst_ms=100):
accepted, rejected, t0 = 0, 0, time.monotonic()
for i in range(n_requests):
if limiter.try_acquire():
accepted += 1
else:
rejected += 1
await asyncio.sleep(burst_ms / 1000.0 / n_requests)
elapsed = time.monotonic() - t0
print(f"{name:14s} burst {n_requests} in {elapsed*1000:.0f}ms -> "
f"accepted={accepted} rejected={rejected}")
async def main():
print("--- after 5s idle, 200 requests in 100ms ---")
tb = TokenBucket(rate=100, capacity=200)
lb = LeakyBucket(rate=100, capacity=200)
await asyncio.sleep(5.0) # idle time accumulates token-bucket credit
await burst_test("token-bucket", tb)
await burst_test("leaky-bucket", lb)
print("--- sustained 300 RPS for 2s ---")
tb = TokenBucket(rate=100, capacity=200)
lb = LeakyBucket(rate=100, capacity=200)
a_tb = a_lb = r_tb = r_lb = 0
t0 = time.monotonic()
while time.monotonic() - t0 < 2.0:
if tb.try_acquire(): a_tb += 1
else: r_tb += 1
if lb.try_acquire(): a_lb += 1
else: r_lb += 1
await asyncio.sleep(1.0/300.0)
print(f"token-bucket: accepted={a_tb} rejected={r_tb}")
print(f"leaky-bucket: accepted={a_lb} rejected={r_lb}")
asyncio.run(main())
Sample run on Python 3.11 (numbers stable across runs because the math is deterministic in absence of jitter):
--- after 5s idle, 200 requests in 100ms ---
token-bucket burst 200 in 100ms -> accepted=200 rejected=0
leaky-bucket burst 200 in 100ms -> accepted=10 rejected=190
--- sustained 300 RPS for 2s ---
token-bucket: accepted=400 rejected=200
leaky-bucket: accepted=200 rejected=400
Per-line walkthrough. TokenBucket.try_acquire is three lines of arithmetic: compute elapsed wall time since the last call, add elapsed * rate tokens (capped at capacity), then check if a token is available. The state is two floats — tokens and last — so the limiter is lock-free per request and trivially memory-cheap. min(self.capacity, ...) is the burst-cap line: without it, a long idle period would let the bucket fill unboundedly and a single request could be answered with a 1000-RPS burst, defeating the point. LeakyBucket.try_acquire has the dual structure: drain at elapsed * rate (floored at 0), then check if adding n would overflow. max(0.0, ...) is the floor — once the bucket is empty, idle time does not give the next request a "credit". The output shows the two key behaviours: idle-then-burst accepts everything in token-bucket and almost nothing in leaky-bucket; sustained overload at 3× rate accepts roughly the rate from each, but token-bucket front-loads (the 200 token cap is spent in the first 200 requests). Why the sustained-overload numbers differ: the token-bucket's first 200 requests consume the cap, after which steady-state at 100/s gives 100 × 2 = 200 more accepts, total 400. The leaky bucket starts empty, so the first 100 RPS over 2 seconds are accepted (200 total) and everything else overflows. Same long-run rate, different transient.
GCRA — the algorithm production limiters actually run
The naive token-bucket above stores two floats and updates them on every request. At 1M RPS across a Redis-backed limiter, those two updates are a hot spot — and they require a read-modify-write against a shared store, which means a Lua script or a WATCH/MULTI/EXEC round-trip. The Generic Cell Rate Algorithm (GCRA) — invented by the ATM Forum in 1996 for traffic shaping in ATM networks — stores one number: the Theoretical Arrival Time (TAT) of the next allowed request. A request is allowed if now() ≥ TAT - τ where τ is the burst tolerance; on accept, TAT := max(TAT, now()) + 1/r. One float, one comparison, one write. Stripe's rate limiter blog post (2019) walks through the same derivation: GCRA is mathematically equivalent to token-bucket but cheaper to store in Redis.
@dataclass
class GCRA:
rate: float # requests per second
burst: int # tolerance (= equivalent to token-bucket capacity)
tat: float = 0.0 # theoretical arrival time of next request
def try_acquire(self) -> bool:
now = time.monotonic()
increment = 1.0 / self.rate
tau = self.burst * increment # burst tolerance window
new_tat = max(self.tat, now) + increment
if new_tat - now <= tau:
self.tat = new_tat
return True
return False
The Redis-backed implementation is one Lua script, atomic, and stores a single key per limiter. Why GCRA is the practical choice: it gives the same accept/reject decision as token-bucket but stores 8 bytes per (client, route) pair instead of 16 bytes, and the update is a single SET-with-expiry instead of a GET-then-SET pair. At 100k routes × 10M clients, that is 80 GB vs 160 GB of state — and GCRA's atomic single-write maps cleanly to Redis's atomicity model. Cloudflare's rate-limiter, Stripe's API limiter, Envoy's local rate-limit filter all use GCRA or a sliding-window variant.
Common confusions
-
"Token bucket and leaky bucket give the same results." They give the same long-run average rate. They give wildly different transients: token-bucket lets you burst against accumulated allowance; leaky-bucket smooths every burst into a stream. A merchant hitting "100 RPS" against a leaky-bucket limiter will get 429s on bursts that a token-bucket limiter accepts. The contract documentation must say which algorithm — naming the rate alone is ambiguous.
-
"Leaky bucket is the same as a queue with a fixed-rate consumer." Almost — but the non-queueing leaky-bucket variant rejects on overflow rather than queueing. The two flavours are the queue leaky-bucket (delay overflow to the next available drain slot, increasing latency) and the meter leaky-bucket (reject overflow with 429). API gateways always use the meter form because queueing introduces unbounded latency tails; router output queues use the queue form because they need to preserve in-order delivery.
-
"Sliding-window-counter is just leaky bucket with a different name." It is closer to token-bucket — it counts requests in a sliding time window of width
1/r * capacity. The classic implementation (Cloudflare's blog, 2017) interpolates between a current and previous fixed window to approximate a true sliding window. Mathematically it is equivalent to token-bucket withcapacity = rate × window; the storage trade-off is different (two integers vs one float) and the rounding error at window boundaries is different. -
"Rate limiting is the same as load shedding." They overlap but answer different questions. Rate limiting answers "is this client over their quota right now?" — the client-fairness question. Load shedding answers "is the backend about to fall over?" — the global-health question. A request can be within its rate limit and still be load-shed (because the backend is at 95% CPU); it can be over its rate limit and still served (because the backend is idle and the limiter is in shadow-mode). Production systems run both: rate limit by client, load shed by backend health.
-
"The cap C is just a number you tune." The cap encodes the burst tolerance — the maximum number of requests you allow back-to-back regardless of long-run rate. Set it equal to
rand you have effectively a per-second hard cap with no burst. Set it to100·rand you have a "bursty" API where one client can monopolise a downstream for 100 seconds. The right cap is set by the downstream's burst-absorption capacity (connection pool depth, DB query parallelism), not by aesthetics. -
"Distributed rate limiting is just sharded local rate limiting." Sharding by client ID gives you per-client limits cheaply but breaks if a client's traffic lands on multiple gateway nodes (round-robin LB, load-aware LB). The choices are: (a) sticky-routing per client, (b) a centralised counter store (Redis with GCRA), (c) a token-bucket-with-eventual-consistency where each gateway has a local sub-budget that is periodically reconciled. Most production limiters at >1M RPS run option (c) because option (b)'s Redis becomes the bottleneck.
Going deeper
The ATM Forum's TM 4.0 specification — where GCRA came from
The Generic Cell Rate Algorithm was standardised in 1996 in the ATM Forum's Traffic Management Specification 4.0 (TM 4.0) for shaping cell streams at ATM switches. ATM cells are 53 bytes fixed-size, transmitted at 155 Mbps or 622 Mbps line rates, and the switch had to enforce a peak cell rate and a sustained cell rate with very low memory per virtual circuit — there could be 64 000 active VCs on a single port. The two-float token-bucket was too expensive; GCRA's single-TAT representation fit. The same algorithm survived the death of ATM and became the reference for any limiter that needs to be cheap per-key. The specification also defines the dual leaky bucket as the formal counterpart to GCRA — peak-rate leaky-bucket plus sustained-rate leaky-bucket, with both having to admit the request.
Distributed limiters at 1M+ RPS — the Redis-backed pattern and its drift
The naive distributed rate limiter is "every gateway calls Redis on every request, atomically INCR a key with TTL=1s, reject if > N". This works to ~100k RPS per Redis instance and falls over at higher rates because the network round-trip dominates. The production fix is the token-bucket-with-eventual-consistency pattern: each gateway holds a local sub-budget (e.g. 1/N of the global rate, where N is the gateway count), and a background process every 100 ms reconciles consumption against Redis. The trade-off is temporary over-consumption: if traffic suddenly skews 5:1 across gateways, the over-loaded gateway can serve more than its sub-budget for up to 100 ms before reconciliation pulls budget from the under-loaded gateways. Stripe's 2019 limiter post and Lyft's Envoy-based limiter (2020) both describe this pattern in detail. Discord's Trillium gateway (the post-Go-to-Rust rewrite, 2024) uses a similar pattern but with a SWIM gossip layer between gateways instead of a centralised Redis, eliminating the round-trip entirely.
KapitalKite's market-open thundering herd — when the limiter became the outage
At 09:14:59 IST on 2025-08-12, KapitalKite's order-placement API was serving 18 000 RPS steady-state. At 09:15:00 the markets opened. Within 200 ms the rate jumped to 285 000 RPS as 1.4 million retail clients all sent the order they had pre-staged in the app. The gateway was running a Redis-backed GCRA limiter with rate=200, burst=400 per client. The per-client check passed for almost every client (each individual was sending 1–2 orders). But the global Redis was now serving 285 000 GET-then-SET pairs per second; the 99th-percentile Redis latency went from 0.4 ms to 38 ms; the gateway's 8 ms request budget for the limiter call was exhausted; the gateway fell back to fail-open (the default for safety in case Redis is down) and stopped rate limiting entirely; the order-service behind it received 285 000 RPS against a backend sized for 80 000 RPS, and the cascade reached the matching engine in 4 seconds. The limiter itself was the bottleneck, and its fail-open mode masked the problem. The post-mortem fix was three-layered: (1) move per-client GCRA to local in-memory state on each gateway (no Redis call on the hot path), (2) reconcile to Redis every 200 ms in a background loop, (3) add a global circuit breaker that fails closed (drop traffic) if Redis is unreachable, on the principle that under known overload, rejecting is safer than admitting. p99 of the order API went from 38 ms back to 0.6 ms; the next market-open at 09:15 the following day held steady at 285 000 RPS with no fallback.
Fairness across clients — why min-max fairness needs more than a per-client limiter
Per-client rate limits give you a floor (no client can starve another) but not fairness (no client gets more than its fair share when the system is under-utilised). If client A has a limit of 100 RPS and client B has a limit of 100 RPS and the system can serve 1000 RPS total but is currently serving 50 RPS for A and nothing else, you would like A to be allowed to burst to 1000 — both clients are within their per-client cap, and there is spare capacity. A pure per-client limiter cannot do this; it always caps A at 100 even when capacity is idle. The fix is hierarchical token-bucket (HTB, the Linux tc qdisc htb algorithm, 2002): a global bucket with a high rate, with per-client sub-buckets that borrow from the global pool when they are below their cap and return when they are above. HTB is the algorithm that lets a single TCP flow saturate a gigabit link when nobody else is using it, but caps it at 100 Mbps when the link is contested. The same machinery applies to API rate limiting: per-client + global, with borrowing.
Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
# save rate_limiters.py from the article body
python3 rate_limiters.py
# Expected: token-bucket accepts the 200-burst after idle; leaky-bucket rejects 190 of 200.
# At 300 RPS sustained, both throttle to ~100 RPS but token-bucket has front-loaded 200 tokens.
Where this leads next
Rate limiting sits at the same layer as the rest of the reliability stack. It composes with:
- Retries with exponential backoff and jitter — chapter 42; a client that retries blindly against a token-bucket limiter will empty the bucket and stay empty for the full retry storm, so retries must respect
Retry-Afterheaders from the limiter. - Circuit breakers (Hystrix, Sentinel) — chapter 43; circuit breakers protect from sick backends, rate limiters protect from greedy clients. Both can return the same 503 — the difference is who the protection is for.
- Bulkheads — chapter 44; bulkheads cap concurrency, rate limiters cap arrival rate. A correctly-sized bulkhead is
concurrency = rate × p99_latency(Little's Law). - Timeouts and deadline propagation — chapter 45; the limiter check itself must run inside the request's deadline — a Redis-backed limiter that takes 38 ms when the deadline has 8 ms left needs to fail-fast.
- Hedged requests for the long tail — chapter 47; hedging doubles the offered load against the limiter, so the limiter must account for hedge fan-out.
- Load balancing — chapter 38; if the LB does not pin a client to a single gateway, the local-limiter pattern needs reconciliation.
The composition is load-balancer → rate-limiter → bulkhead → breaker → retry → timeout → RPC. The rate-limiter sits early because it is the cheapest rejection — every layer downstream pays more per rejected request. A senior PaySetu SRE put it: the cheapest 429 is the one returned before TLS handshake completes; the most expensive 429 is the one returned after the database lookup. Move the limiter as far upstream as the data allows.
References
- ATM Forum, "Traffic Management Specification Version 4.0" (1996) — the canonical specification of GCRA and the dual leaky bucket; the source for every modern API rate limiter.
- Stripe Engineering, "Scaling your API with rate limiters" (2019) — production write-up of GCRA in Redis with Lua; covers the four limiter types Stripe runs (request rate, concurrent request, fleet, charge).
- Cloudflare Blog, "How we built rate limiting capable of scaling to millions of domains" (2017) — sliding-window-counter algorithm, with the interpolation trick.
- Lyft Engineering, "Announcing Ratelimit" (2020) — Envoy's Go-based distributed limiter, used in production at Lyft.
- Devanbu & Cardelli, "An algorithm for fair queueing" (1989) — original fair-queueing paper that motivated HTB and weighted fair queueing.
- Linux
tc-htb(8)man page — the Hierarchical Token Bucket implementation in the Linux kernel; reference for borrowing-and-returning semantics. - Dean & Barroso, "The Tail at Scale" (CACM 2013) — argues that any throttling layer must respect deadline propagation, or it becomes the source of the tail it is meant to protect.
- Timeouts and deadline propagation — chapter 45; the partner pattern that ensures rate-limit checks themselves do not eat the request budget.