It’s 3 AM. Your phone buzzes. “API is down.” You check the logs and see it: someone is hammering your endpoint with 50,000 requests per second. Your database is screaming, response times are through the roof, and legitimate users can’t access your service.
You need a rate limiter. But not just any rate limiter - you need one that’s smart enough to adapt to changing conditions without blocking your paying customers.
Let’s discuss how to build one.
The Problem With Fixed Rate Limits
Most rate limiters are dumb. They follow rigid rules:
- Free users: 100 requests per hour
- Paid users: 1,000 requests per hour
- No exceptions, no flexibility
This works until it doesn’t. Here’s what breaks:
Scenario 1: Off-Peak Waste
It’s 2 AM. Your servers are at 15% CPU. You have massive spare capacity. But your rate limiter is still enforcing the same strict limits as peak hours. You’re turning away requests you could easily handle.
Scenario 2: Peak Hour Meltdown
It’s Black Friday. Traffic is 10x normal. Your database is struggling at 90% CPU. But your rate limiter is still allowing the same request rates. Your servers start timing out, and the whole system cascades into failure.
Scenario 3: One Bad Actor Ruins Everything
A misconfigured client starts retrying failed requests aggressively. They’re not malicious, just buggy. Your fixed rate limiter treats them like any other user. They burn through their quota, but not before they’ve contributed to system-wide slowdown.
We need something smarter.
What Makes a Rate Limiter “Dynamic”?
A dynamic rate limiter adjusts its behavior in real-time based on:
- System Health: CPU, memory, database connections, response times
- User Behavior: Request patterns, burst detection, historical reputation
- Time Patterns: Business hours vs. nights, weekdays vs. weekends
- Endpoint Cost: Lightweight reads vs. expensive writes or analytics
Think of it like a smart traffic light that changes timing based on actual traffic flow, not just a fixed schedule.
The Core Architecture
flowchart TD
A[Incoming Request] --> B{Rate Limiter}
B --> C[Check User Tier]
B --> D[Monitor System Load]
B --> E[Analyze Request Pattern]
C --> F{Calculate Dynamic Limit}
D --> F
E --> F
F --> G{Within Limit?}
G -->|Yes| H[Allow Request]
G -->|No| I[429 Too Many Requests]
H --> J[Forward to API]
I --> K[Return Retry-After]
J --> L[Update Counters]
K --> L
style B fill:#fef3c7,stroke:#f59e0b,stroke-width:3px
style F fill:#fef3c7,stroke:#f59e0b,stroke-width:3px
style H fill:#dcfce7,stroke:#16a34a,stroke-width:2px
style I fill:#fee2e2,stroke:#dc2626,stroke-width:2px
The rate limiter sits between your users and your API. Every request flows through it. But unlike a simple counter, it’s constantly adjusting the limits based on real-time conditions.
Choosing Your Algorithm
Before we get into the dynamic parts, you need a solid base algorithm. Here are your options:
1. Token Bucket (Most Popular)
Imagine a bucket that holds tokens. Each request consumes one token. Tokens refill at a steady rate. When the bucket is empty, requests are rejected.
Why it works: Allows bursts of traffic (use saved-up tokens) while maintaining an average rate over time.
Real-world use: AWS API Gateway, Stripe API
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def allow_request(self):
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
Pro: Smooths out traffic, allows controlled bursts
Con: Can still allow 2x capacity if bucket is full when burst hits
Why the 2x issue happens: Imagine your bucket holds 100 tokens and refills at 100 tokens/second. If a user stays quiet for a while, the bucket fills completely to 100 tokens. Then suddenly they send a burst of 200 requests in one second. Here’s what happens:
- First 100 requests: Use all accumulated tokens (bucket now empty)
- During that same second: Bucket refills 100 new tokens
- Next 100 requests: Use the newly refilled tokens
Result: 200 requests get through in one second, even though your “rate limit” was 100/second. The burst capacity plus the continuous refill rate combine. This isn’t necessarily bad - it allows legitimate bursts but it’s something to be aware of if you need strict upper bounds.
2. Leaky Bucket
Requests enter a queue (the bucket). They’re processed at a constant rate. If the queue fills up, new requests are rejected.
Why it works: Enforces absolute rate limits. Output is always smooth.
Real-world use: Network traffic shaping, SQS queues
Pro: Perfect rate smoothing, no bursts ever
Con: Adds latency (requests wait in queue), can delay legitimate traffic
3. Sliding Window
Tracks requests over a rolling time window. More accurate than fixed windows because it accounts for requests at window boundaries.
Why it works: No edge case where you can sneak in 2x requests at window boundaries.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Redis implementation using sorted sets for sliding window
# Start atomic transaction
MULTI
# Add current request with timestamp as score
# Each request is stored as a member, timestamp determines order
ZADD rate_limit:{user_id} {current_timestamp} {request_id}
# Clean up: Remove requests older than the window
# For example, if window is 60 seconds, remove anything older
ZREMRANGEBYSCORE rate_limit:{user_id} 0 {timestamp - window_size}
# Count remaining requests in current window
# This count is checked against your rate limit
ZCARD rate_limit:{user_id}
# Set TTL to prevent memory leaks
# Key auto-expires if no more requests come in
EXPIRE rate_limit:{user_id} {window_size}
# Execute all commands atomically
EXEC
Pro: Most accurate, fair distribution
Con: Requires more memory (stores timestamps)
4. Fixed Window (Simplest)
Count requests in fixed time windows (e.g., 1 minute). Reset counter when window expires.
Why it works: Dead simple, minimal memory.
Pro: Easy to implement and understand
Con: Can allow 2x traffic at window boundaries (100 requests at 12:59:59, 100 more at 13:00:00)
My recommendation: Start with Token Bucket. It’s the best balance between burst handling and long-term rate control.
Making It Distributed
Your rate limiter needs to work across multiple API servers. The naive approach of in-memory counters doesn’t work - each server would have its own count, letting users bypass limits by hitting different servers.
Solution: Redis.
graph TB
subgraph "API Servers"
A1[Server 1]
A2[Server 2]
A3[Server 3]
end
subgraph "Redis Cluster"
R1[Redis Master]
R2[Redis Replica 1]
R3[Redis Replica 2]
end
subgraph "Data Stored"
D1[User Counters]
D2[Rate Limits]
D3[System Metrics]
end
A1 --> R1
A2 --> R1
A3 --> R1
R1 --> R2
R1 --> R3
R1 --> D1
R1 --> D2
R1 --> D3
style R1 fill:#fae8ff,stroke:#a855f7,stroke-width:3px
style A1 fill:#dcfce7,stroke:#16a34a,stroke-width:2px
style A2 fill:#dcfce7,stroke:#16a34a,stroke-width:2px
style A3 fill:#dcfce7,stroke:#16a34a,stroke-width:2px
Redis becomes your single source of truth. All servers read/write to the same counters.
Why Redis?
- Fast: Sub-millisecond operations
- Atomic: INCR command prevents race conditions
- Built-in expiry: TTL automatically cleans up old counters
- Widely available: Every cloud provider has managed Redis
Basic Redis implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Lua script for atomic rate limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call('GET', key)
if current and tonumber(current) >= limit then
return 0 -- Rate limit exceeded
end
redis.call('INCR', key)
redis.call('EXPIRE', key, window)
return 1 -- Request allowed
Call this from your API server:
1
2
3
4
5
6
7
8
9
10
def is_allowed(user_id, limit, window_seconds):
key = f"rate_limit:{user_id}"
result = redis.eval(
lua_script,
1, # number of keys
key,
limit,
window_seconds
)
return result == 1
Now The Dynamic Part: Adapting to System Load
Here’s where it gets interesting. Instead of fixed limits, you adjust based on real-time metrics.
Strategy 1: Load-Based Adjustment
Monitor your system metrics and scale limits accordingly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def get_dynamic_limit(base_limit, user_tier):
# Get current system metrics
cpu_usage = get_cpu_usage() # 0-100
memory_usage = get_memory_usage() # 0-100
db_connections = get_db_connections() # current/max
response_time_p95 = get_response_time_p95() # milliseconds
# Calculate health score (0-1)
health = 1.0
if cpu_usage > 80:
health *= 0.7 # Reduce limits by 30%
elif cpu_usage > 60:
health *= 0.9
elif cpu_usage < 40:
health *= 1.2 # Increase limits by 20%
if memory_usage > 85:
health *= 0.6
if db_connections > 0.9:
health *= 0.5
if response_time_p95 > 1000: # Over 1 second
health *= 0.7
# Apply user tier multiplier
tier_multiplier = {
'free': 1.0,
'basic': 2.0,
'premium': 5.0,
'enterprise': 20.0
}.get(user_tier, 1.0)
# Calculate final limit
dynamic_limit = int(base_limit * tier_multiplier * health)
# Never go below minimum
return max(dynamic_limit, 10)
What this does: During low load (2 AM, low traffic), limits increase by 20%. During high load (peak hours, struggling servers), limits drop by 30-70% to protect the system.
Strategy 2: User Behavior Analysis
Not all users are equal. Track behavior and adjust accordingly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class UserReputation:
def __init__(self, user_id):
self.user_id = user_id
self.score = 100 # Start neutral
def update(self, metrics):
# Positive signals
if metrics['error_rate'] < 0.01: # Less than 1% errors
self.score += 1
if metrics['retry_count'] < 3:
self.score += 1
# Negative signals
if metrics['error_rate'] > 0.1: # More than 10% errors
self.score -= 10
if metrics['retry_count'] > 10: # Aggressive retries
self.score -= 20
if metrics['burst_detected']: # Sudden spike
self.score -= 5
# Keep score in bounds
self.score = max(0, min(200, self.score))
def get_multiplier(self):
# Score 0-50: Suspicious, reduce limits to 0.5x
# Score 50-100: Normal
# Score 100-200: Trusted, allow 1.5x
if self.score < 50:
return 0.5
elif self.score > 150:
return 1.5
return 1.0
Strategy 3: Time-Based Patterns
Different times need different limits.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def get_time_multiplier():
current_hour = datetime.now().hour
current_day = datetime.now().weekday()
# Business hours (9 AM - 5 PM, weekdays): Normal
# Off hours: Increase limits
# Peak hours (12-2 PM): Decrease limits
if 9 <= current_hour <= 17 and current_day < 5:
# Business hours
if 12 <= current_hour <= 14:
return 0.9 # Peak lunch time
return 1.0
else:
# Off-peak
if 0 <= current_hour <= 6:
return 1.3 # Nighttime, lots of spare capacity
return 1.1
Strategy 4: Endpoint-Specific Limits
Not all endpoints cost the same.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ENDPOINT_COSTS = {
'/api/users': {
'cost': 1, # Lightweight read
'limit': 1000
},
'/api/search': {
'cost': 5, # Elasticsearch query
'limit': 200
},
'/api/reports': {
'cost': 50, # Heavy analytics
'limit': 20
},
'/api/upload': {
'cost': 100, # File processing
'limit': 5
}
}
def check_endpoint_limit(user_id, endpoint):
config = ENDPOINT_COSTS.get(endpoint, {'cost': 1, 'limit': 100})
# Deduct cost from user's budget
budget_key = f"budget:{user_id}"
current_budget = redis.get(budget_key) or 1000
if current_budget >= config['cost']:
redis.decrby(budget_key, config['cost'])
return True
return False
Putting It All Together
Here’s a complete dynamic rate limiter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class DynamicRateLimiter:
def __init__(self, redis_client):
self.redis = redis_client
def allow_request(self, user_id, endpoint, user_tier):
# Get base limit for endpoint
base_limit = self.get_base_limit(endpoint)
# Apply all multipliers
system_health = self.get_system_health_multiplier()
user_reputation = self.get_user_reputation_multiplier(user_id)
time_factor = self.get_time_multiplier()
tier_factor = self.get_tier_multiplier(user_tier)
# Calculate final limit
final_limit = int(
base_limit *
system_health *
user_reputation *
time_factor *
tier_factor
)
# Check against Redis
key = f"rate:{user_id}:{endpoint}"
current = self.redis.incr(key)
if current == 1:
self.redis.expire(key, 60) # 1 minute window
if current <= final_limit:
# Log for analytics
self.log_request(user_id, endpoint, allowed=True)
return True, final_limit, current
else:
self.log_request(user_id, endpoint, allowed=False)
return False, final_limit, current
def get_system_health_multiplier(self):
cpu = psutil.cpu_percent()
memory = psutil.virtual_memory().percent
if cpu > 80 or memory > 85:
return 0.6 # Reduce to 60%
elif cpu < 40 and memory < 60:
return 1.3 # Increase to 130%
return 1.0
# ... other helper methods
Usage in your API:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
@app.route('/api/users')
def get_users():
user_id = get_current_user_id()
user_tier = get_user_tier(user_id)
allowed, limit, current = rate_limiter.allow_request(
user_id,
'/api/users',
user_tier
)
# Add rate limit headers
response.headers['X-RateLimit-Limit'] = limit
response.headers['X-RateLimit-Remaining'] = limit - current
response.headers['X-RateLimit-Reset'] = get_reset_time()
if not allowed:
return jsonify({'error': 'Rate limit exceeded'}), 429
# Process request
return jsonify(get_users_data())
Real-World Examples
Stripe: Tiered + Load-Based
Stripe uses a combination of user tiers and system health:
- Test mode: 25 requests per second
- Live mode: 100 requests per second (verified accounts)
- High-volume businesses: Custom limits negotiated
During incidents or high load, they temporarily reduce all limits by 20-40% across the board. They announce this via their status page.
Twitter API: Credits System
Twitter uses a credits-based system:
- Each endpoint has a cost in credits
- Simple tweets: 1 credit
- Search: 20 credits
- User timeline: 75 credits
- You get 300 credits per 15-minute window
This makes expensive operations naturally rate-limited while keeping cheap operations plentiful.
Netflix: Adaptive Based on Device
Netflix adjusts streaming quality and API limits based on:
- Device type (mobile gets lower quality on cellular)
- Network speed
- Server load in the region
- User’s subscription tier
During peak hours (8-10 PM), they slightly reduce API request limits for non-critical features like recommendations while keeping playback APIs unrestricted.
AWS API Gateway: Burst + Steady State
AWS uses token bucket with:
- Burst capacity: 5,000 requests (bucket size)
- Steady state: 10,000 requests per second (refill rate)
You can handle sudden spikes using the burst capacity, but sustained load is limited by the refill rate. Both are configurable.
The Response: What to Return
When you block a request, be helpful:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1699564800
Retry-After: 45
Content-Type: application/json
{
"error": {
"type": "rate_limit_exceeded",
"message": "You have exceeded your rate limit of 1000 requests per hour.",
"limit": 1000,
"remaining": 0,
"reset_at": "2025-11-05T15:00:00Z",
"retry_after": 45,
"docs_url": "https://docs.yourapi.com/rate-limits"
}
}
Essential headers:
- X-RateLimit-Limit: The current limit (can change if dynamic)
- X-RateLimit-Remaining: How many requests left
- X-RateLimit-Reset: Unix timestamp when limit resets
- Retry-After: Seconds to wait before retrying
Common Pitfalls to Avoid
1. Race Conditions
Multiple servers incrementing the same Redis counter simultaneously can cause overages.
Fix: Use Lua scripts for atomic operations:
1
2
3
4
5
6
7
8
9
10
11
-- Atomic increment with limit check
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local current = tonumber(redis.call('GET', key) or 0)
if current >= limit then
return 0
end
redis.call('INCR', key)
return 1
2. Clock Skew
Servers with different system times cause inconsistent rate limiting.
Fix: Use NTP to synchronize clocks, or use Redis TIME command for centralized timestamps.
3. Memory Leaks
Forgetting to set expiry on Redis keys fills up memory.
Fix: Always set TTL:
1
2
3
redis.setex(key, ttl_seconds, value)
# Or
redis.set(key, value, ex=ttl_seconds)
4. Too Aggressive Throttling
Reacting instantly to brief CPU spikes can create oscillating behavior.
Fix: Add dampening - only adjust if metrics stay high for 2-3 minutes:
1
2
3
4
def should_reduce_limits():
# Check if CPU has been high for last 3 data points
recent_cpu = get_cpu_history(count=3)
return all(cpu > 80 for cpu in recent_cpu)
5. Ignoring Retries
Rate-limited clients often retry immediately, making the problem worse.
Fix: Implement exponential backoff on the client side, and penalize aggressive retriers:
1
2
3
4
if retry_count > 5:
# Aggressive retrier, increase their penalty
penalty_duration = min(3600, retry_count * 60)
redis.setex(f"penalty:{user_id}", penalty_duration, 1)
Monitoring Your Rate Limiter
Track these metrics:
- Rate limit hits by user: Who’s getting blocked?
- Rate limit hits by endpoint: Which endpoints are bottlenecks?
- System health when limiting: Are we reacting appropriately?
- False positive rate: Are we blocking legitimate users?
- Redis latency: Is Redis becoming the bottleneck?
Tools and Libraries
Don’t reinvent the wheel:
| Language/Platform | Library/Tool | Description |
|---|---|---|
| Node.js | express-rate-limit |
Easy middleware for Express |
rate-limiter-flexible |
Supports Redis, DynamoDB, memory | |
| Python | django-ratelimit |
Django-specific rate limiting |
flask-limiter |
Flask middleware | |
slowapi |
FastAPI rate limiting | |
| Ruby | rack-attack |
Flexible rate limiting for Rack apps |
| Go | go-redis/redis_rate |
Redis-based rate limiting |
tollbooth |
HTTP rate limiter middleware | |
| Java | bucket4j |
Token bucket implementation |
resilience4j-ratelimiter |
Part of resilience4j | |
| Infrastructure | Kong | API gateway with built-in rate limiting |
| Nginx | Rate limiting module | |
| AWS API Gateway | Managed rate limiting | |
| Cloudflare | Edge-level rate limiting |
Key Takeaways
-
Start simple, add complexity as needed. A basic fixed rate limiter is better than nothing.
-
Use Redis for distributed setups. It’s fast, atomic, and proven at scale.
-
Token bucket is the sweet spot for most use cases. Allows bursts while maintaining long-term limits.
-
Make it dynamic based on system health. Adjust limits when your system is struggling or has spare capacity.
-
Different users deserve different treatment. Paid users get higher limits. Misbehaving users get lower limits.
-
Be transparent with clients. Return clear headers and error messages explaining what happened.
-
Monitor and iterate. Track who’s getting blocked and why. Adjust your strategy based on data.
-
Don’t block at all costs. Sometimes it’s better to let extra traffic through and scale up than to block paying customers.
The goal isn’t to block requests - it’s to keep your system healthy and responsive for everyone. A good dynamic rate limiter is invisible when things are smooth, and protective when things get rough.
Want more system design deep dives? Check out How Shopify Powers 5 Million Stores Without Breaking a Sweat and Change Data Capture Explained.
References: Stripe Rate Limits, Redis Rate Limiting, AWS API Gateway Throttling