Rate limiting is a critical component of any production API or service. It protects your system from abuse, ensures fair usage among consumers, and maintains service quality under load. This article explores the most common rate limiting algorithms and how to choose the right one for your use case.
Why Rate Limiting Matters
Rate limiting prevents a single client from consuming an unfair share of resources. Without it, a misbehaving client or an accidental infinite loop can degrade service for all users. Rate limiting also helps:
Token Bucket Algorithm
The token bucket is one of the most popular rate limiting algorithms. A bucket holds a fixed number of tokens. Tokens are added at a steady rate. Each request consumes one token. If the bucket is empty, the request is denied.
Capacity: 100 tokens
Refill rate: 10 tokens/second
Request at t=0: 100 tokens available -> success, 99 remaining
Request at t=1: 99 + 10 = 100 -> success, 99 remaining
Request at t=0.5: 99 + 5 = 100 -> success, 99 remaining
**Pros:** Allows bursts up to the bucket capacity. Memory efficient -- you only need to store the token count and last refill timestamp.
**Cons:** The burst behavior may not suit all use cases. With a large bucket, a client can send an entire day's worth of requests in a few seconds.
The token bucket is used by Amazon (AWS API Gateway), Stripe, and GitHub.
Leaky Bucket Algorithm
The leaky bucket is similar to the token bucket but processes requests at a fixed rate. Think of a bucket with a hole in the bottom. Requests are poured into the top. They leak out of the bottom at a constant rate. If the bucket overflows, excess requests are rejected.
**Pros:** Smooths out traffic perfectly. No bursts at all. Predictable processing rate.
**Cons:** Cannot handle any bursts, even legitimate ones. A client that occasionally needs to send 10 requests at once will be penalized.
Leaky bucket is ideal for scenarios where a steady processing rate is critical, such as database connection pooling or queue workers.
Fixed Window Counter
Divide time into fixed windows (e.g., 1 minute). Count requests in each window. If the count exceeds the limit, reject further requests until the next window.
Limit: 100 requests per minute
Window: 00:00:00 to 00:01:00
95 requests at 00:00:45 -> allowed
6 requests at 00:00:55 -> 5 allowed, 1 rejected (exceeds 100)
Window resets at 00:01:00
**Pros:** Extremely simple to implement. Only requires a counter per window per client.
**Cons:** The "boundary problem." At the edge of a window, a client can send 100 requests at 00:00:59 and another 100 at 00:01:01, effectively 200 requests in 2 seconds. This can overwhelm your system.
Sliding Window Log
Instead of fixed windows, track a timestamp for each request. Count requests within the last N seconds.
**Pros:** Eliminates the boundary problem. Perfectly accurate rate limiting.
**Cons:** Memory intensive. You need to store a timestamp for every request. For high-traffic APIs, this can require significant storage.
Sliding Window Counter
A hybrid approach combining fixed windows with sliding precision. Track counters for the current and previous windows. Approximate the count for the sliding window using weighted averages.
Limit: 100 requests per minute
Current 1-minute window: 80 requests
Previous 1-minute window: 40 requests
At 30 seconds into the current window:
Estimated count = 40 * (30/60) + 80 = 100
**Pros:** Smooths out the boundary problem with minimal memory (just two counters per client). Good enough for most use cases.
**Cons:** Not perfectly accurate. The approximation error varies but is typically acceptable for rate limiting purposes.
This is the algorithm used by Cloudflare and many CDN-based rate limiters.
Distributed Rate Limiting
In a distributed system with multiple application instances, rate limiting becomes more complex. Each instance cannot track limits independently, or the combined traffic might exceed the limit.
**Centralized approach:** Use a shared data store like Redis to maintain counters. The `INCR` and `EXPIRE` commands make Redis ideal for rate limiting. This gives global accuracy but adds latency for every request.
**Local approach:** Each instance independently tracks its share of the limit (e.g., if there are 3 instances and the limit is 300, each instance limits to 100). This is fast but can exceed the limit if traffic distribution is uneven.
**Hybrid approach:** Use local rate limiting for fast checks and periodic synchronization to a central store. This balances performance and accuracy.
Response Headers
Tell clients about their rate limit status through response headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1640995200
When a client exceeds the limit, return `429 Too Many Requests` with a `Retry-After` header:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Choosing the Right Algorithm
Summary
Rate limiting is essential for production APIs. Start with the token bucket algorithm for general-purpose rate limiting. Use Redis for distributed scenarios. Always return descriptive headers so clients can adjust their behavior. Monitor your rate limiting effectiveness and adjust thresholds based on real traffic patterns.