Why Rate Limiting Matters
API rate limiting protects backend services from abuse, ensures fair resource distribution, and prevents cascading failures. Without rate limiting, a single aggressive client can degrade the experience for all other users or even crash the service entirely. For public APIs, rate limiting is a fundamental security control that mitigates DDoS attacks, credential stuffing, and web scraping.
Rate Limiting Algorithms
Token Bucket
The token bucket algorithm is the most widely used approach. A bucket holds a fixed number of tokens, and each request consumes one token. Tokens are replenished at a steady rate. If the bucket is empty, the request is denied.
class TokenBucket:
def __init__(self, capacity, refill_rate, refill_interval=1.0):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.refill_interval = refill_interval
self.last_refill = time.time()
def allow_request(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity,
self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
The token bucket allows short bursts up to `capacity` while enforcing a long-term average rate. This makes it ideal for APIs where occasional spikes are acceptable.
Leaky Bucket
The leaky bucket algorithm enforces a strict processing rate. Incoming requests fill a queue, and a worker processes them at a fixed rate. If the queue is full, new requests are dropped.
This approach smooths out traffic perfectly but does not handle bursts well. It is best suited for downstream systems that cannot tolerate spikes, such as legacy databases or third-party APIs with strict rate contracts.
Sliding Window Log
The sliding window algorithm maintains a timestamped log of recent requests within a time window. When a new request arrives, entries older than the window are pruned. If the count of remaining entries exceeds the limit, the request is denied.
class SlidingWindowLog:
def __init__(self, limit, window_ms=1000):
self.limit = limit
self.window_ms = window_ms
self.log = deque()
def allow_request(self):
now = time.time() * 1000
cutoff = now - self.window_ms
while self.log and self.log[0] < cutoff:
self.log.popleft()
if len(self.log) < self.limit:
self.log.append(now)
return True
return False
Sliding window gives precise per-user limits and avoids the boundary spikes of fixed-window counters, at the cost of storing request timestamps in memory.
Implementation Patterns
Middleware Pattern (Node.js / Express)
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
windowMs: 60 * 1000,
max: 100,
standardHeaders: true,
legacyHeaders: false,
keyGenerator: (req) => req.ip,
handler: (req, res) => {
res.status(429).json({
error: 'Too many requests',
retryAfter: Math.ceil(req.rateLimit.resetTime / 1000)
});
}
});
app.use('/api/', limiter);
Distributed Rate Limiting with Redis
For applications running across multiple instances, in-memory rate limiting is insufficient. Use Redis with atomic operations:
-- Redis Lua script for sliding window counter
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('EXPIRE', key, window / 1000)
return 1
end
return 0
HTTP Response Headers
Always communicate rate limits to clients via standard headers:
| Header | Purpose |
|--------|---------|
| `X-RateLimit-Limit` | Maximum requests per window |
| `X-RateLimit-Remaining` | Requests left in current window |
| `X-RateLimit-Reset` | Unix timestamp when the window resets |
| `Retry-After` | Seconds to wait before retrying (on 429) |
Tiered Rate Limiting
Apply different limits based on client tiers:
| Tier | Limit | Window |
|------|-------|--------|
| Free | 10 req/s | 1 second |
| Pro | 100 req/s | 1 second |
| Enterprise | 1000 req/s | 1 second |
Common Pitfalls
Summary
Choose token bucket for general-purpose APIs, sliding window for precise per-user limits, and leaky bucket for downstream protection. Always distribute limit state via Redis when running multiple service instances, and communicate limits through standard HTTP headers so clients can adapt their behavior.