Design Problems
Designing a Rate Limiter
Control how many requests a client can make in a window — rejecting the excess with HTTP 429. A study in algorithms, atomic counters, and scaling writes across Redis shards.
Updated 2026-06-09
Designing a Rate Limiter
A rate limiter is a traffic controller for your API: it allows, say, 100 requests per minute from a client, then rejects the rest with HTTP 429 Too Many Requests. It prevents abuse, protects servers from traffic bursts, and enforces fair usage. We'll design a server-side, request-level limiter — clients can't be trusted to self-regulate.
1. Requirements
Functional: identify clients (by user ID, IP, or API key), limit requests by configurable rules (e.g. 100/min/user), and on breach return 429 with helpful headers (remaining, reset time).
Non-functional: add minimal latency (< 10ms per check), be highly available
(eventual consistency across nodes is fine), and handle ~1M requests/sec across ~100M
daily users.
2. Where does it live?
Options: a middleware in each service, a sidecar, or the API Gateway. The gateway is the common choice — centralized control, no extra network hop per request. The trade-off is that it only sees the HTTP request itself (URL, headers, IP), which is enough.
3. Identifying clients
The key you limit on determines how limits apply:
- User ID (from a JWT in
Authorization) — best for authenticated APIs. - IP address (from
X-Forwarded-For) — for anonymous traffic, but watch NAT/proxies that share one IP. - API key (
X-API-Key) — for developer APIs.
Real systems layer multiple rules and enforce the most restrictive: per-user, per-IP, global, and per-endpoint. If Alice is under her 1000/hr user limit but her IP hit 100/min, she's blocked.
4. The algorithms
Four production approaches, trading accuracy vs memory vs complexity:
| Algorithm | Accuracy | Memory | Note |
|---|---|---|---|
| Fixed window | Low | Tiny (1 counter) | Boundary bursts |
| Sliding log | Perfect | High (all timestamps) | Exact but expensive |
| Sliding counter | Good | Tiny (2 counters) | Approximation |
| Token bucket | Good | Tiny (count + timestamp) | Handles bursts — our pick |
- Fixed window — count requests per fixed bucket (e.g. each minute), reset at the
boundary. Dead simple, but a user can fire 100 at
12:00:59and 100 at12:01:00— 200 in two seconds. - Sliding window log — store every request's timestamp, drop those older than the window, count the rest. Perfectly accurate, no boundary effect — but storing thousands of timestamps per user across millions of users blows up memory.
- Sliding window counter — keep the current and previous window counters and weight them by how far into the current window you are (30% in → 70% of previous + 100% of current). Near-accurate with just two counters; the math is an approximation that assumes even traffic.
- Token bucket — a bucket holds up to N tokens (the burst capacity) and refills
at a steady rate. Each request spends one token; no token → reject. Handles both
sustained load (refill) and bursts (capacity) with just
(tokens, last_refill)per client.
We pick token bucket — the best balance of simplicity, memory, and real-world bursty traffic. Stripe uses it for the same reason.
5. Storing bucket state in Redis
Bucket state (tokens, last_refill) must be shared across all gateway
instances — if Alice's 100 requests split across gateways, each sees only 50 and
lets them through. A central Redis is the source of truth:
1. Request for "alice" hits Gateway A
2. HMGET alice:bucket tokens last_refill ← read state
3. add tokens = (now - last_refill) * refill_rate, capped at capacity
4. if tokens >= 1: allow & decrement, else reject
5. write back state, EXPIRE alice:bucket 3600 ← auto-cleanup idle buckets
The race condition
If the read (HMGET) happens outside the update, two simultaneous requests both
read the same count, both decide "allowed," and both write — letting 2 through when
only 1 token existed. A MULTI/EXEC transaction doesn't fix it, because the read is
still separate.
Fix: move the entire read-calculate-update into one atomic Lua script run inside Redis. Lua scripts are atomic, so the whole decision is race-free. This is the classic expand the atomic boundary to cover the read-modify-write pattern.
6. Responding with 429
Fail fast — reject excess immediately rather than queueing (queues consume memory, confuse clients into retrying, and make latency unpredictable). Return 429 with headers so well-behaved clients can back off:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
Retry-After: 60
7. Scaling to 1M requests/sec
A single Redis handles ~100–200K ops/sec, and each check is multiple ops — so one instance tops out well below 1M RPS. Shard the data across ~10 Redis instances.
Sharding must be consistent: all of a client's requests must hit the same shard, or their state splits and becomes useless. Hash the client identifier (user ID / IP / API key) with consistent hashing to pick a shard. In production, Redis Cluster does this for you — it spreads keys across 16,384 hash slots automatically.
8. High availability
Each shard is now critical. When one fails, you choose a failure mode:
- Fail-open — allow requests when the limiter is down (favors availability).
- Fail-closed — reject them (favors protection).
For a social platform, fail-closed is often right: limiter failures tend to coincide with traffic spikes, exactly when you need protection — a brief outage beats a backend meltdown. Better still, prevent failures with master-replica replication per shard and automatic failover (built into Redis Cluster). Add health and rate-limit-success monitoring on top.
9. Minimizing latency
- Connection pooling — reuse persistent Redis connections to skip the TCP handshake (20–50ms) on every check. The single biggest win.
- Geographic distribution — put gateways and Redis close to users; accept eventual consistency between regions for lower latency.
- Lua scripting / pipelining batches ops, but usually unnecessary if pooling and geo-distribution are handled.
10. Hot keys
A single client generating tens of thousands of requests/sec can overwhelm one shard.
- Legitimate high-volume clients — encourage client-side rate limiting (smooths traffic), allow request batching, offer premium tiers with higher limits.
- Abusive traffic — auto-block repeat offenders by adding their IP/key to a blocklist (kept in Redis), and front the system with DDoS protection (Cloudflare, AWS Shield).
Design IP limits to account for shared IPs (corporate NAT, public WiFi) upfront — set higher IP limits and lean on authenticated user limits — rather than chasing hot keys after the fact.
Takeaways
| Topic | Key idea |
|---|---|
| Placement | API Gateway — central, no extra hop |
| Algorithm | Token bucket — handles bursts with minimal state |
| Correctness | Atomic Lua script to kill the read-modify-write race |
| Scale | Shard Redis by client key (consistent hashing / Redis Cluster) |
| Availability | Fail-closed under load + master-replica failover |
| Latency | Connection pooling and geo-distribution |
