Rate limiting: token bucket, leaky bucket

PaySetu's public payments API publishes a quota of "100 requests per second per merchant". A merchant integration team writes a batch settlement script that sleeps 10 ms between calls. At 02:14 IST the script wakes up after a 90-second pause, fires 380 calls in the first 50 ms because the merchant's local clock had drifted, and gets 280 of them rejected with 429 Too Many Requests. The merchant escalates: "your docs say 100 RPS — we sent 380 in 50 ms which averages to 100 RPS over the second, why are you rejecting?" The answer is that PaySetu's gateway runs a leaky-bucket limiter that shapes traffic to a smooth 100 RPS regardless of how requests arrive, while the merchant assumed a token bucket that lets you burst against accumulated allowance. Same number on the contract, two different mechanisms, and the difference is the gap between an integration that works on the second try and one that pages a developer at 02:14. The two algorithms agree on the long-run average and disagree on every other property: burst tolerance, queueing delay, fairness across clients, and what the rejection pattern looks like under sustained overload.

Token bucket and leaky bucket are dual algorithms with the same long-run rate but opposite philosophies. Token bucket accumulates allowance up to a cap and lets a client spend the cap as a burst — bursty traffic is allowed, idle time is rewarded. Leaky bucket meters every request through a fixed-rate output queue — bursty arrivals are smoothed (or dropped), idle time is forfeited. Production limiters are almost always token-bucket variants (GCRA, sliding-window-counter) because real APIs need to absorb bursts; leaky-bucket survives mostly in router queues and ingress shaping.

Two buckets, two stories

A bucket is a simple object: a capacity, a current level, and a rule for adding and removing. The two algorithms differ only in what flows in and what flows out.

Token bucket. Tokens are added at a steady rate r (e.g. 100 per second). The bucket has a cap C (e.g. 200). Each request consumes one token; if the bucket is empty, the request is rejected (or queued, depending on the variant). Idle time accumulates tokens up to the cap. A client that has been quiet for 2 seconds with r=100, C=200 arrives with the bucket full at 200 tokens — it can fire 200 requests instantly, then is throttled to the steady-state r once the bucket drains.

Leaky bucket. Requests are added to the bucket as they arrive. The bucket has a cap C. Requests drain at a steady rate r — exactly one request leaves the bucket every 1/r seconds, no matter how full the bucket is. If a request arrives and the bucket is full, it is rejected (or, in the queueing leaky-bucket, the bucket grows but the output rate stays fixed — so the request is delayed, not rejected). The bucket enforces a smooth output regardless of arrival pattern.

Why these are duals: token bucket controls when the bucket can be emptied (request rate); leaky bucket controls when the bucket can be drained (output rate). Tokens flow in at r, requests flow out at r. The same r parameter, the same C parameter, but the polarity of the flow is opposite — and that flips every secondary property: burst handling, queueing, fairness, latency tail.

Token bucket vs leaky bucket — dual mechanisms, opposite polarityTwo side-by-side bucket diagrams. Left: token bucket. A faucet drips tokens into a bucket at steady rate r. Requests arrive from below as arrows pulling tokens out; if the bucket is empty, the request is rejected. The bucket can fill to cap C during idle time. Right: leaky bucket. Requests pour in from above into a bucket. The bucket drains from a hole at the bottom at steady rate r. If the bucket overflows the cap C, the request is rejected. Both buckets show the same r and C parameters but opposite flow direction. Token bucket vs leaky bucket — same r, same C, opposite polarity Token bucket Leaky bucket tokens added at rate r cap C level: tokens available request consumes 1 token if empty: 429 requests arrive cap C if full: overflow=429 drains at rate r (smooth output)
Illustrative. Left: tokens drip in at rate r, requests pull tokens out — bursts are allowed up to the cap. Right: requests pour in, bucket drains at rate r — output is smooth, bursts overflow.

What the two algorithms actually do under a burst

The clearest way to see the difference is to draw the response to a step input: a client is silent for 5 seconds (allowing the token bucket to refill), then issues 200 requests in 100 ms.

Both algorithms emit roughly 100 requests in the first second, but token-bucket emits 200 in the first 100 ms (then nothing for 1 second) while leaky-bucket emits 10 in the first 100 ms and 90 in the rest of the second. From the upstream service's point of view, token bucket is bursty and leaky bucket is smooth. From the client's point of view, token bucket rewards being well-behaved (idle time turns into burst credit), while leaky bucket punishes the burst regardless of prior behaviour.

Why this matters for the user-visible experience: a user who taps a button after sitting on the screen for 5 seconds should not be punished for the next 5 seconds because they "saved up" requests they did not make. Token bucket matches this intuition. A backend service that genuinely cannot absorb bursts (e.g. a downstream that has a fixed-size connection pool) needs leaky bucket to protect itself. The right choice depends on which side of the API you are protecting.

Response to a 5s-idle, 200-request, 100ms burst — token vs leakyA line chart showing accepted requests over time. X-axis is wall time from -1s to 2s with a marker at 0 where a 200-request 100ms burst arrives. Y-axis is cumulative accepted requests from 0 to 200. Token bucket curve jumps from 0 to 200 at t=0 (accepts the entire burst from accumulated capacity). Leaky bucket curve rises gradually at slope 100 per second, reaching 10 at t=100ms, 110 at t=1.1s, 200 at t=2s. Annotations show the 5 seconds of idle time before t=0 that token bucket has banked into burst credit. Cumulative accepts after 5s idle, then 200-request 100ms burst 200 100 0 -1s 0 (burst arrives) 0.5s 1s 2s 5s idle (banked) token-bucket: 200 accepted in 100ms leaky-bucket: ~100/s smooth drain cap C=200 spent instantly 190 of 200 rejected by leaky
Illustrative — not measured. Token bucket spends its 200-token cap instantly on the burst (vertical jump at t=0). Leaky bucket drains at the steady 100/s rate; 190 of the 200 burst requests overflow and are rejected.

A working implementation — both algorithms in one file

# rate_limiters.py — token bucket and leaky bucket on the same interface
import time, asyncio, random
from dataclasses import dataclass

@dataclass
class TokenBucket:
    rate: float       # tokens added per second
    capacity: float   # max tokens
    tokens: float = 0.0
    last: float = 0.0
    def __post_init__(self):
        self.tokens = self.capacity
        self.last = time.monotonic()
    def try_acquire(self, n: float = 1.0) -> bool:
        now = time.monotonic()
        elapsed = now - self.last
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last = now
        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

@dataclass
class LeakyBucket:
    rate: float       # requests drained per second
    capacity: float   # max queue depth
    level: float = 0.0
    last: float = 0.0
    def __post_init__(self):
        self.last = time.monotonic()
    def try_acquire(self, n: float = 1.0) -> bool:
        now = time.monotonic()
        elapsed = now - self.last
        self.level = max(0.0, self.level - elapsed * self.rate)
        self.last = now
        if self.level + n <= self.capacity:
            self.level += n
            return True
        return False

async def burst_test(name, limiter, n_requests=200, burst_ms=100):
    accepted, rejected, t0 = 0, 0, time.monotonic()
    for i in range(n_requests):
        if limiter.try_acquire():
            accepted += 1
        else:
            rejected += 1
        await asyncio.sleep(burst_ms / 1000.0 / n_requests)
    elapsed = time.monotonic() - t0
    print(f"{name:14s} burst {n_requests} in {elapsed*1000:.0f}ms -> "
          f"accepted={accepted} rejected={rejected}")

async def main():
    print("--- after 5s idle, 200 requests in 100ms ---")
    tb = TokenBucket(rate=100, capacity=200)
    lb = LeakyBucket(rate=100, capacity=200)
    await asyncio.sleep(5.0)  # idle time accumulates token-bucket credit
    await burst_test("token-bucket", tb)
    await burst_test("leaky-bucket", lb)
    print("--- sustained 300 RPS for 2s ---")
    tb = TokenBucket(rate=100, capacity=200)
    lb = LeakyBucket(rate=100, capacity=200)
    a_tb = a_lb = r_tb = r_lb = 0
    t0 = time.monotonic()
    while time.monotonic() - t0 < 2.0:
        if tb.try_acquire(): a_tb += 1
        else: r_tb += 1
        if lb.try_acquire(): a_lb += 1
        else: r_lb += 1
        await asyncio.sleep(1.0/300.0)
    print(f"token-bucket:  accepted={a_tb} rejected={r_tb}")
    print(f"leaky-bucket:  accepted={a_lb} rejected={r_lb}")

asyncio.run(main())

Sample run on Python 3.11 (numbers stable across runs because the math is deterministic in absence of jitter):

--- after 5s idle, 200 requests in 100ms ---
token-bucket   burst 200 in 100ms -> accepted=200 rejected=0
leaky-bucket   burst 200 in 100ms -> accepted=10  rejected=190
--- sustained 300 RPS for 2s ---
token-bucket:  accepted=400 rejected=200
leaky-bucket:  accepted=200 rejected=400

Per-line walkthrough. TokenBucket.try_acquire is three lines of arithmetic: compute elapsed wall time since the last call, add elapsed * rate tokens (capped at capacity), then check if a token is available. The state is two floats — tokens and last — so the limiter is lock-free per request and trivially memory-cheap. min(self.capacity, ...) is the burst-cap line: without it, a long idle period would let the bucket fill unboundedly and a single request could be answered with a 1000-RPS burst, defeating the point. LeakyBucket.try_acquire has the dual structure: drain at elapsed * rate (floored at 0), then check if adding n would overflow. max(0.0, ...) is the floor — once the bucket is empty, idle time does not give the next request a "credit". The output shows the two key behaviours: idle-then-burst accepts everything in token-bucket and almost nothing in leaky-bucket; sustained overload at 3× rate accepts roughly the rate from each, but token-bucket front-loads (the 200 token cap is spent in the first 200 requests). Why the sustained-overload numbers differ: the token-bucket's first 200 requests consume the cap, after which steady-state at 100/s gives 100 × 2 = 200 more accepts, total 400. The leaky bucket starts empty, so the first 100 RPS over 2 seconds are accepted (200 total) and everything else overflows. Same long-run rate, different transient.

GCRA — the algorithm production limiters actually run

The naive token-bucket above stores two floats and updates them on every request. At 1M RPS across a Redis-backed limiter, those two updates are a hot spot — and they require a read-modify-write against a shared store, which means a Lua script or a WATCH/MULTI/EXEC round-trip. The Generic Cell Rate Algorithm (GCRA) — invented by the ATM Forum in 1996 for traffic shaping in ATM networks — stores one number: the Theoretical Arrival Time (TAT) of the next allowed request. A request is allowed if now() ≥ TAT - τ where τ is the burst tolerance; on accept, TAT := max(TAT, now()) + 1/r. One float, one comparison, one write. Stripe's rate limiter blog post (2019) walks through the same derivation: GCRA is mathematically equivalent to token-bucket but cheaper to store in Redis.

@dataclass
class GCRA:
    rate: float       # requests per second
    burst: int        # tolerance (= equivalent to token-bucket capacity)
    tat: float = 0.0  # theoretical arrival time of next request
    def try_acquire(self) -> bool:
        now = time.monotonic()
        increment = 1.0 / self.rate
        tau = self.burst * increment            # burst tolerance window
        new_tat = max(self.tat, now) + increment
        if new_tat - now <= tau:
            self.tat = new_tat
            return True
        return False

The Redis-backed implementation is one Lua script, atomic, and stores a single key per limiter. Why GCRA is the practical choice: it gives the same accept/reject decision as token-bucket but stores 8 bytes per (client, route) pair instead of 16 bytes, and the update is a single SET-with-expiry instead of a GET-then-SET pair. At 100k routes × 10M clients, that is 80 GB vs 160 GB of state — and GCRA's atomic single-write maps cleanly to Redis's atomicity model. Cloudflare's rate-limiter, Stripe's API limiter, Envoy's local rate-limit filter all use GCRA or a sliding-window variant.

Common confusions

Going deeper

The ATM Forum's TM 4.0 specification — where GCRA came from

The Generic Cell Rate Algorithm was standardised in 1996 in the ATM Forum's Traffic Management Specification 4.0 (TM 4.0) for shaping cell streams at ATM switches. ATM cells are 53 bytes fixed-size, transmitted at 155 Mbps or 622 Mbps line rates, and the switch had to enforce a peak cell rate and a sustained cell rate with very low memory per virtual circuit — there could be 64 000 active VCs on a single port. The two-float token-bucket was too expensive; GCRA's single-TAT representation fit. The same algorithm survived the death of ATM and became the reference for any limiter that needs to be cheap per-key. The specification also defines the dual leaky bucket as the formal counterpart to GCRA — peak-rate leaky-bucket plus sustained-rate leaky-bucket, with both having to admit the request.

Distributed limiters at 1M+ RPS — the Redis-backed pattern and its drift

The naive distributed rate limiter is "every gateway calls Redis on every request, atomically INCR a key with TTL=1s, reject if > N". This works to ~100k RPS per Redis instance and falls over at higher rates because the network round-trip dominates. The production fix is the token-bucket-with-eventual-consistency pattern: each gateway holds a local sub-budget (e.g. 1/N of the global rate, where N is the gateway count), and a background process every 100 ms reconciles consumption against Redis. The trade-off is temporary over-consumption: if traffic suddenly skews 5:1 across gateways, the over-loaded gateway can serve more than its sub-budget for up to 100 ms before reconciliation pulls budget from the under-loaded gateways. Stripe's 2019 limiter post and Lyft's Envoy-based limiter (2020) both describe this pattern in detail. Discord's Trillium gateway (the post-Go-to-Rust rewrite, 2024) uses a similar pattern but with a SWIM gossip layer between gateways instead of a centralised Redis, eliminating the round-trip entirely.

KapitalKite's market-open thundering herd — when the limiter became the outage

At 09:14:59 IST on 2025-08-12, KapitalKite's order-placement API was serving 18 000 RPS steady-state. At 09:15:00 the markets opened. Within 200 ms the rate jumped to 285 000 RPS as 1.4 million retail clients all sent the order they had pre-staged in the app. The gateway was running a Redis-backed GCRA limiter with rate=200, burst=400 per client. The per-client check passed for almost every client (each individual was sending 1–2 orders). But the global Redis was now serving 285 000 GET-then-SET pairs per second; the 99th-percentile Redis latency went from 0.4 ms to 38 ms; the gateway's 8 ms request budget for the limiter call was exhausted; the gateway fell back to fail-open (the default for safety in case Redis is down) and stopped rate limiting entirely; the order-service behind it received 285 000 RPS against a backend sized for 80 000 RPS, and the cascade reached the matching engine in 4 seconds. The limiter itself was the bottleneck, and its fail-open mode masked the problem. The post-mortem fix was three-layered: (1) move per-client GCRA to local in-memory state on each gateway (no Redis call on the hot path), (2) reconcile to Redis every 200 ms in a background loop, (3) add a global circuit breaker that fails closed (drop traffic) if Redis is unreachable, on the principle that under known overload, rejecting is safer than admitting. p99 of the order API went from 38 ms back to 0.6 ms; the next market-open at 09:15 the following day held steady at 285 000 RPS with no fallback.

Fairness across clients — why min-max fairness needs more than a per-client limiter

Per-client rate limits give you a floor (no client can starve another) but not fairness (no client gets more than its fair share when the system is under-utilised). If client A has a limit of 100 RPS and client B has a limit of 100 RPS and the system can serve 1000 RPS total but is currently serving 50 RPS for A and nothing else, you would like A to be allowed to burst to 1000 — both clients are within their per-client cap, and there is spare capacity. A pure per-client limiter cannot do this; it always caps A at 100 even when capacity is idle. The fix is hierarchical token-bucket (HTB, the Linux tc qdisc htb algorithm, 2002): a global bucket with a high rate, with per-client sub-buckets that borrow from the global pool when they are below their cap and return when they are above. HTB is the algorithm that lets a single TCP flow saturate a gigabit link when nobody else is using it, but caps it at 100 Mbps when the link is contested. The same machinery applies to API rate limiting: per-client + global, with borrowing.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
# save rate_limiters.py from the article body
python3 rate_limiters.py
# Expected: token-bucket accepts the 200-burst after idle; leaky-bucket rejects 190 of 200.
# At 300 RPS sustained, both throttle to ~100 RPS but token-bucket has front-loaded 200 tokens.

Where this leads next

Rate limiting sits at the same layer as the rest of the reliability stack. It composes with:

The composition is load-balancer → rate-limiter → bulkhead → breaker → retry → timeout → RPC. The rate-limiter sits early because it is the cheapest rejection — every layer downstream pays more per rejected request. A senior PaySetu SRE put it: the cheapest 429 is the one returned before TLS handshake completes; the most expensive 429 is the one returned after the database lookup. Move the limiter as far upstream as the data allows.

References

  1. ATM Forum, "Traffic Management Specification Version 4.0" (1996) — the canonical specification of GCRA and the dual leaky bucket; the source for every modern API rate limiter.
  2. Stripe Engineering, "Scaling your API with rate limiters" (2019) — production write-up of GCRA in Redis with Lua; covers the four limiter types Stripe runs (request rate, concurrent request, fleet, charge).
  3. Cloudflare Blog, "How we built rate limiting capable of scaling to millions of domains" (2017) — sliding-window-counter algorithm, with the interpolation trick.
  4. Lyft Engineering, "Announcing Ratelimit" (2020) — Envoy's Go-based distributed limiter, used in production at Lyft.
  5. Devanbu & Cardelli, "An algorithm for fair queueing" (1989) — original fair-queueing paper that motivated HTB and weighted fair queueing.
  6. Linux tc-htb(8) man page — the Hierarchical Token Bucket implementation in the Linux kernel; reference for borrowing-and-returning semantics.
  7. Dean & Barroso, "The Tail at Scale" (CACM 2013) — argues that any throttling layer must respect deadline propagation, or it becomes the source of the tail it is meant to protect.
  8. Timeouts and deadline propagation — chapter 45; the partner pattern that ensures rate-limit checks themselves do not eat the request budget.