Which rate limiting algorithm is best?

Token bucket is the usual pick — it offers the best balance of simplicity, memory, and handling real-world bursty traffic. A bucket holds up to N tokens (the burst capacity) and refills at a steady rate; each request spends a token and is rejected when none remain, needing only a count and timestamp per client. Alternatives are fixed window (simple but allows boundary bursts), sliding window log (perfectly accurate but memory-heavy), and sliding window counter (a good two-counter approximation). Stripe uses token bucket.

How do you make a distributed rate limiter accurate across servers?

Store each client's bucket state (tokens, last_refill) in a central Redis shared by all gateway instances, otherwise requests split across gateways each see only part of the count. To avoid a race condition where two simultaneous requests both read the same count and both allow, move the entire read-calculate-update into a single atomic Lua script run inside Redis, which expands the atomic boundary to cover the whole read-modify-write.

Should a rate limiter fail open or fail closed?

It depends on whether you favor availability or protection. Fail-open allows requests when the limiter is down (favoring availability); fail-closed rejects them (favoring protection). For a social platform, fail-closed is often right because limiter failures tend to coincide with traffic spikes — exactly when protection is needed — so a brief outage beats a backend meltdown.

Design Problems

Designing a Rate Limiter

Q: How do you scale a rate limiter to 1 million requests per second?

A single Redis handles only about 100,000–200,000 ops/sec and each check is multiple ops, so shard the data across roughly 10 Redis instances. Sharding must be consistent — all of a client's requests must hit the same shard — so hash the client identifier with consistent hashing, or use Redis Cluster, which spreads keys across 16,384 hash slots automatically.

Control how many requests a client can make in a window — rejecting the excess with HTTP 429. A study in algorithms, atomic counters, and scaling writes across Redis shards.

Updated 2026-06-09

Designing a Rate Limiter

A rate limiter is a traffic controller for your API: it allows, say, 100 requests per minute from a client, then rejects the rest with HTTP 429 Too Many Requests. It prevents abuse, protects servers from traffic bursts, and enforces fair usage. We'll design a server-side, request-level limiter — clients can't be trusted to self-regulate.

1. Requirements

Functional: identify clients (by user ID, IP, or API key), limit requests by configurable rules (e.g. 100/min/user), and on breach return 429 with helpful headers (remaining, reset time).

Non-functional: add minimal latency (< 10ms per check), be highly available (eventual consistency across nodes is fine), and handle ~1M requests/sec across ~100M daily users.

2. Where does it live?

Options: a middleware in each service, a sidecar, or the API Gateway. The gateway is the common choice — centralized control, no extra network hop per request. The trade-off is that it only sees the HTTP request itself (URL, headers, IP), which is enough.

3. Identifying clients

The key you limit on determines how limits apply:

User ID (from a JWT in Authorization) — best for authenticated APIs.
IP address (from X-Forwarded-For) — for anonymous traffic, but watch NAT/proxies that share one IP.
API key (X-API-Key) — for developer APIs.

Real systems layer multiple rules and enforce the most restrictive: per-user, per-IP, global, and per-endpoint. If Alice is under her 1000/hr user limit but her IP hit 100/min, she's blocked.

4. The algorithms

Four production approaches, trading accuracy vs memory vs complexity:

Algorithm	Accuracy	Memory	Note
Fixed window	Low	Tiny (1 counter)	Boundary bursts
Sliding log	Perfect	High (all timestamps)	Exact but expensive
Sliding counter	Good	Tiny (2 counters)	Approximation
Token bucket	Good	Tiny (count + timestamp)	Handles bursts — our pick

Fixed window — count requests per fixed bucket (e.g. each minute), reset at the boundary. Dead simple, but a user can fire 100 at 12:00:59 and 100 at 12:01:00 — 200 in two seconds.
Sliding window log — store every request's timestamp, drop those older than the window, count the rest. Perfectly accurate, no boundary effect — but storing thousands of timestamps per user across millions of users blows up memory.
Sliding window counter — keep the current and previous window counters and weight them by how far into the current window you are (30% in → 70% of previous + 100% of current). Near-accurate with just two counters; the math is an approximation that assumes even traffic.
Token bucket — a bucket holds up to N tokens (the burst capacity) and refills at a steady rate. Each request spends one token; no token → reject. Handles both sustained load (refill) and bursts (capacity) with just (tokens, last_refill) per client.

We pick token bucket — the best balance of simplicity, memory, and real-world bursty traffic. Stripe uses it for the same reason.

5. Storing bucket state in Redis

Bucket state (tokens, last_refill) must be shared across all gateway instances — if Alice's 100 requests split across gateways, each sees only 50 and lets them through. A central Redis is the source of truth:

1. Request for "alice" hits Gateway A
2. HMGET alice:bucket tokens last_refill        ◀ read state
3. add tokens = (now - last_refill) * refill_rate, capped at capacity
4. if tokens >= 1: allow & decrement, else reject
5. write back state, EXPIRE alice:bucket 3600   ◀ auto-cleanup idle buckets

The race condition

If the read (HMGET) happens outside the update, two simultaneous requests both read the same count, both decide "allowed," and both write — letting 2 through when only 1 token existed. A MULTI/EXEC transaction doesn't fix it, because the read is still separate.

Fix: move the entire read-calculate-update into one atomic Lua script run inside Redis. Lua scripts are atomic, so the whole decision is race-free. This is the classic expand the atomic boundary to cover the read-modify-write pattern.

6. Responding with 429

Fail fast — reject excess immediately rather than queueing (queues consume memory, confuse clients into retrying, and make latency unpredictable). Return 429 with headers so well-behaved clients can back off:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
Retry-After: 60

7. Scaling to 1M requests/sec

A single Redis handles ~100–200K ops/sec, and each check is multiple ops — so one instance tops out well below 1M RPS. Shard the data across ~10 Redis instances.

Sharding must be consistent: all of a client's requests must hit the same shard, or their state splits and becomes useless. Hash the client identifier (user ID / IP / API key) with consistent hashing to pick a shard. In production, Redis Cluster does this for you — it spreads keys across 16,384 hash slots automatically.

8. High availability

Each shard is now critical. When one fails, you choose a failure mode:

Fail-open — allow requests when the limiter is down (favors availability).
Fail-closed — reject them (favors protection).

For a social platform, fail-closed is often right: limiter failures tend to coincide with traffic spikes, exactly when you need protection — a brief outage beats a backend meltdown. Better still, prevent failures with master-replica replication per shard and automatic failover (built into Redis Cluster). Add health and rate-limit-success monitoring on top.

9. Minimizing latency

Connection pooling — reuse persistent Redis connections to skip the TCP handshake (20–50ms) on every check. The single biggest win.
Geographic distribution — put gateways and Redis close to users; accept eventual consistency between regions for lower latency.
Lua scripting / pipelining batches ops, but usually unnecessary if pooling and geo-distribution are handled.

10. Hot keys

A single client generating tens of thousands of requests/sec can overwhelm one shard.

Legitimate high-volume clients — encourage client-side rate limiting (smooths traffic), allow request batching, offer premium tiers with higher limits.
Abusive traffic — auto-block repeat offenders by adding their IP/key to a blocklist (kept in Redis), and front the system with DDoS protection (Cloudflare, AWS Shield).

Design IP limits to account for shared IPs (corporate NAT, public WiFi) upfront — set higher IP limits and lean on authenticated user limits — rather than chasing hot keys after the fact.

Takeaways

Topic	Key idea
Placement	API Gateway — central, no extra hop
Algorithm	Token bucket — handles bursts with minimal state
Correctness	Atomic Lua script to kill the read-modify-write race
Scale	Shard Redis by client key (consistent hashing / Redis Cluster)
Availability	Fail-closed under load + master-replica failover
Latency	Connection pooling and geo-distribution

Design Problems

Designing a Rate Limiter

Control how many requests a client can make in a window — rejecting the excess with HTTP 429. A study in algorithms, atomic counters, and scaling writes across Redis shards.

Updated 2026-06-09

Designing a Rate Limiter

1. Requirements

Functional: identify clients (by user ID, IP, or API key), limit requests by configurable rules (e.g. 100/min/user), and on breach return 429 with helpful headers (remaining, reset time).

Non-functional: add minimal latency (< 10ms per check), be highly available (eventual consistency across nodes is fine), and handle ~1M requests/sec across ~100M daily users.

2. Where does it live?

3. Identifying clients

The key you limit on determines how limits apply:

User ID (from a JWT in Authorization) — best for authenticated APIs.
IP address (from X-Forwarded-For) — for anonymous traffic, but watch NAT/proxies that share one IP.
API key (X-API-Key) — for developer APIs.

Real systems layer multiple rules and enforce the most restrictive: per-user, per-IP, global, and per-endpoint. If Alice is under her 1000/hr user limit but her IP hit 100/min, she's blocked.

4. The algorithms

Four production approaches, trading accuracy vs memory vs complexity:

Algorithm	Accuracy	Memory	Note
Fixed window	Low	Tiny (1 counter)	Boundary bursts
Sliding log	Perfect	High (all timestamps)	Exact but expensive
Sliding counter	Good	Tiny (2 counters)	Approximation
Token bucket	Good	Tiny (count + timestamp)	Handles bursts — our pick

Fixed window — count requests per fixed bucket (e.g. each minute), reset at the boundary. Dead simple, but a user can fire 100 at 12:00:59 and 100 at 12:01:00 — 200 in two seconds.
Sliding window log — store every request's timestamp, drop those older than the window, count the rest. Perfectly accurate, no boundary effect — but storing thousands of timestamps per user across millions of users blows up memory.
Sliding window counter — keep the current and previous window counters and weight them by how far into the current window you are (30% in → 70% of previous + 100% of current). Near-accurate with just two counters; the math is an approximation that assumes even traffic.
Token bucket — a bucket holds up to N tokens (the burst capacity) and refills at a steady rate. Each request spends one token; no token → reject. Handles both sustained load (refill) and bursts (capacity) with just (tokens, last_refill) per client.

We pick token bucket — the best balance of simplicity, memory, and real-world bursty traffic. Stripe uses it for the same reason.

5. Storing bucket state in Redis

1. Request for "alice" hits Gateway A
2. HMGET alice:bucket tokens last_refill        ◀ read state
3. add tokens = (now - last_refill) * refill_rate, capped at capacity
4. if tokens >= 1: allow & decrement, else reject
5. write back state, EXPIRE alice:bucket 3600   ◀ auto-cleanup idle buckets

The race condition

6. Responding with 429

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
Retry-After: 60

7. Scaling to 1M requests/sec

A single Redis handles ~100–200K ops/sec, and each check is multiple ops — so one instance tops out well below 1M RPS. Shard the data across ~10 Redis instances.

8. High availability

Each shard is now critical. When one fails, you choose a failure mode:

Fail-open — allow requests when the limiter is down (favors availability).
Fail-closed — reject them (favors protection).

9. Minimizing latency

Connection pooling — reuse persistent Redis connections to skip the TCP handshake (20–50ms) on every check. The single biggest win.
Geographic distribution — put gateways and Redis close to users; accept eventual consistency between regions for lower latency.
Lua scripting / pipelining batches ops, but usually unnecessary if pooling and geo-distribution are handled.

10. Hot keys

A single client generating tens of thousands of requests/sec can overwhelm one shard.

Legitimate high-volume clients — encourage client-side rate limiting (smooths traffic), allow request batching, offer premium tiers with higher limits.
Abusive traffic — auto-block repeat offenders by adding their IP/key to a blocklist (kept in Redis), and front the system with DDoS protection (Cloudflare, AWS Shield).

Design IP limits to account for shared IPs (corporate NAT, public WiFi) upfront — set higher IP limits and lean on authenticated user limits — rather than chasing hot keys after the fact.

Takeaways

Topic	Key idea
Placement	API Gateway — central, no extra hop
Algorithm	Token bucket — handles bursts with minimal state
Correctness	Atomic Lua script to kill the read-modify-write race
Scale	Shard Redis by client key (consistent hashing / Redis Cluster)
Availability	Fail-closed under load + master-replica failover
Latency	Connection pooling and geo-distribution

Designing a Rate Limiter

Designing a Rate Limiter

1. Requirements

2. Where does it live?

3. Identifying clients

4. The algorithms

5. Storing bucket state in Redis

The race condition

6. Responding with 429

7. Scaling to 1M requests/sec

8. High availability

9. Minimizing latency

10. Hot keys

Takeaways

Practice this

Designing a Rate Limiter

Designing a Rate Limiter

1. Requirements

2. Where does it live?

3. Identifying clients

4. The algorithms

5. Storing bucket state in Redis

The race condition

6. Responding with 429

7. Scaling to 1M requests/sec

8. High availability

9. Minimizing latency

10. Hot keys

Takeaways

Practice this