The thundering herd problem occurs when many requests simultaneously hit the same backend resource, usually after a cache key expires. All requests see the miss, all query the database at once, and the database crashes under the spike. Solutions: Add jitter to TTLs, use request coalescing (singleflight), distributed locking with stale data fallback, probabilistic early recomputation, or write buffering with queues.
It is 3 AM. Your phone buzzes. “Database CPU at 100%. All queries timing out.”
You check the logs. The database went from handling 500 queries per second to receiving 15,000 queries in the same second. Your cache key for the product catalog expired at midnight. Every server in your fleet discovered the miss at the same time. Every one of them hit the database. The database fell over. And because every request timed out, the retry logic kicked in. Now instead of 15,000 queries, you are looking at 30,000.
This is the thundering herd problem. And it has taken down systems at companies of every size.
Table of Contents
- What Is the Thundering Herd Problem?
- Why Systems Collapse Under Thundering Herd
- Where It Shows Up in Real Systems
- Six Solutions That Actually Work
- How Big Tech Companies Solved It
- Practical Implementation Guide
- Which Solution Should You Pick?
- Lessons Learned
What Is the Thundering Herd Problem?
The thundering herd problem happens when a large number of requests hit the same backend resource at the same time. The name comes from a herd of animals startled awake simultaneously, all stampeding in the same direction.
In software systems, this most commonly shows up as a cache stampede (also called the dog pile effect). Here is the sequence:
- A popular cache key expires
- Hundreds of concurrent requests arrive for that key
- Every request sees a cache miss
- Every request queries the database
- The database gets overwhelmed and either slows down or crashes
sequenceDiagram
participant C1 as Server 1
participant C2 as Server 2
participant C3 as Server 3
participant Cache as Redis Cache
participant DB as Database
Note over Cache: Cache key expires
C1->>Cache: GET product:123
C2->>Cache: GET product:123
C3->>Cache: GET product:123
Cache-->>C1: MISS
Cache-->>C2: MISS
Cache-->>C3: MISS
C1->>DB: SELECT * FROM products...
C2->>DB: SELECT * FROM products...
C3->>DB: SELECT * FROM products...
Note over DB: All 3 run the same<br/>expensive query
DB-->>C1: Result
DB-->>C2: Result
DB-->>C3: Result
C1->>Cache: SET product:123
C2->>Cache: SET product:123
C3->>Cache: SET product:123
Note over Cache: Same value written 3 times
Now imagine this with 500 servers instead of 3. That is the thundering herd.
The term originally comes from operating systems. When multiple threads wait on a lock or event, the OS wakes all of them even though only one can proceed. The rest waste CPU cycles going back to sleep. Linux fixed this at the kernel level with the EPOLLEXCLUSIVE flag in kernel 4.5, which wakes only one thread per event. But in distributed systems, the problem persists because there is no central OS coordinating your servers.
Why Systems Collapse Under Thundering Herd
The math is what makes this dangerous. Consider a system with these specs:
| Metric | Value |
|---|---|
| Cache TTL | 1 hour |
| Requests per second | 5,000 |
| Database capacity | 500 concurrent queries |
| Backend query time | 200ms |
Under normal operation, the cache absorbs 99% of reads. The database handles a comfortable 50 queries per second. Life is good.
Then the cache key expires.
In the 200ms it takes to execute one database query, 1,000 requests arrive (5,000 RPS * 0.2 seconds). All 1,000 hit the database. Your database capacity is 500 concurrent queries. It is now overloaded by 2x.
But it gets worse. The overloaded database takes longer to respond. Response times climb from 200ms to 2 seconds. In those 2 seconds, 10,000 more requests pile up. And then the retries begin.
flowchart TD
A[Cache Key Expires] --> B[1,000 Requests Hit DB]
B --> C[DB Overloaded]
C --> D[Response Time: 200ms to 2s]
D --> E[10,000 More Requests Queue Up]
E --> F[Timeouts Trigger Retries]
F --> G[20,000 Total Requests]
G --> H[Database Crashes]
H --> I[All Requests Fail]
I --> J[More Retries]
J --> G
style A fill:#fff3e0,stroke:#e65100,stroke-width:2px
style H fill:#fee2e2,stroke:#dc2626,stroke-width:2px
style J fill:#fee2e2,stroke:#dc2626,stroke-width:2px
style I fill:#fee2e2,stroke:#dc2626,stroke-width:2px
This is a positive feedback loop. The system’s failure response (retries) generates more load, which causes more failures, which triggers more retries. What started as a 2x overload snowballs into a 40x overload.
The key insight: systems are designed for sustained rate, not instantaneous load. Your database handles 500 QPS just fine. But it cannot handle 10,000 requests arriving in the same millisecond.
Where It Shows Up in Real Systems
Cache expiration is the most common trigger, but the thundering herd problem appears in many other scenarios.
Service Restart Storms
You deploy a new version of your service. The rolling deployment restarts 50 instances. Each instance starts with an empty local cache. All 50 instances simultaneously hit the database or downstream services to warm their caches.
Auth Token Expiry
Your microservices share a JWT token with a fixed expiry. When it expires, all 30 services request a new token from the auth server at the exact same second. The auth server buckles under the synchronized load.
Synchronized Scheduled Tasks
You have 100 servers running a cron job at 0 * * * * (top of every hour). All 100 servers trigger the same heavy operation simultaneously. If you need to understand cron expressions better, check out this guide on cron jobs.
Network Partition Recovery
A network partition isolates some clients from your service. When the partition heals, all the queued requests from those clients arrive at once. The service goes from zero load to peak load instantly.
Retry Storms
A downstream service goes down for 30 seconds. During that time, all requests fail and queue up for retry. When the service comes back, every client retries simultaneously. The burst of retries can knock the service back down, creating a cycle.
This pattern is related to what happens during major outages. The AWS US-East outage in October 2025 showed how cascading failures combine with retry storms to amplify problems.
flowchart TD
subgraph triggers["Common Triggers"]
direction LR
T1["Cache Expiration"] ~~~ T2["Service Restarts"] ~~~ T3["Token Expiry"] ~~~ T4["Cron Jobs"] ~~~ T5["Network Recovery"]
end
subgraph effect["The Stampede"]
direction LR
E1["Synchronized Requests"] ~~~ E2["Backend Overload"] ~~~ E3["Timeouts and Retries"]
end
subgraph outcome["System Impact"]
direction LR
O1["Latency Spikes"] ~~~ O2["Error Rate Increase"] ~~~ O3["Potential Outage"]
end
triggers --> effect --> outcome
style T1 fill:#fff3e0,stroke:#e65100,stroke-width:2px
style T2 fill:#fff3e0,stroke:#e65100,stroke-width:2px
style T3 fill:#fff3e0,stroke:#e65100,stroke-width:2px
style T4 fill:#fff3e0,stroke:#e65100,stroke-width:2px
style T5 fill:#fff3e0,stroke:#e65100,stroke-width:2px
style E1 fill:#fee2e2,stroke:#dc2626,stroke-width:2px
style E2 fill:#fee2e2,stroke:#dc2626,stroke-width:2px
style E3 fill:#fee2e2,stroke:#dc2626,stroke-width:2px
style O1 fill:#fef2f2,stroke:#b91c1c,stroke-width:2px
style O2 fill:#fef2f2,stroke:#b91c1c,stroke-width:2px
style O3 fill:#fef2f2,stroke:#b91c1c,stroke-width:2px
Six Solutions That Actually Work
There is no single magic fix. Each solution addresses the problem from a different angle. The best approach usually combines two or three of these.
1. Jitter: Break the Synchronization
The simplest and most effective first step. Jitter adds randomness to TTLs and retry intervals so that requests naturally spread out over time instead of clustering at the same moment.
Without jitter:
All cache keys set with TTL = 3600 expire at exactly the same second. If your service started at 10:00 AM, every key expires at 11:00 AM.
With jitter:
1
2
3
4
5
6
7
8
import random
BASE_TTL = 3600 # 1 hour
JITTER_RANGE = 300 # 5 minutes
def get_ttl():
return BASE_TTL + random.randint(-JITTER_RANGE, JITTER_RANGE)
# Returns between 3300 and 3900 seconds
Instead of all keys expiring at 11:00:00, they expire between 10:55:00 and 11:05:00. The database handles a steady stream of individual refreshes instead of one massive spike.
For retries, use exponential backoff with jitter:
1
2
3
4
5
6
7
8
9
10
11
import random
def get_retry_delay(attempt):
base_delay = min(2 ** attempt, 60) # Cap at 60 seconds
jitter = random.uniform(0, base_delay)
return base_delay + jitter
# Attempt 1: 2-4 seconds
# Attempt 2: 4-8 seconds
# Attempt 3: 8-16 seconds
# Attempt 4: 16-32 seconds
AWS recommends this exact pattern in their architecture best practices. It is simple to implement and eliminates most synchronized retry storms.
When to use: Always. Add jitter to every TTL and retry interval in your system. There is no reason not to.
2. Request Coalescing (Singleflight Pattern)
Request coalescing groups identical concurrent requests and executes only one backend call. All other requests wait for that single result.
sequenceDiagram
participant C1 as Server 1
participant C2 as Server 2
participant C3 as Server 3
participant Coalescer as Request Coalescer
participant DB as Database
Note over Coalescer: Cache key expired
C1->>Coalescer: GET product:123
C2->>Coalescer: GET product:123
C3->>Coalescer: GET product:123
Note over Coalescer: Only 1 request<br/>goes to DB
Coalescer->>DB: SELECT * FROM products...
DB-->>Coalescer: Result
Coalescer-->>C1: Result
Coalescer-->>C2: Result (shared)
Coalescer-->>C3: Result (shared)
Note over DB: Database handled<br/>1 query instead of 3
Go’s singleflight package is the most well known implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import "golang.org/x/sync/singleflight"
var group singleflight.Group
func GetProduct(id string) (*Product, error) {
key := "product:" + id
// Check cache first
if cached, found := cache.Get(key); found {
return cached.(*Product), nil
}
// singleflight ensures only one goroutine fetches from DB
// Others with the same key wait for that result
result, err, _ := group.Do(key, func() (interface{}, error) {
product, err := db.GetProduct(id)
if err != nil {
return nil, err
}
cache.Set(key, product, getTTLWithJitter())
return product, nil
})
if err != nil {
return nil, err
}
return result.(*Product), nil
}
The key idea: 500 concurrent requests for the same product result in 1 database query instead of 500.
In Java, you can achieve the same with CompletableFuture and a ConcurrentHashMap:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
private final ConcurrentHashMap<String, CompletableFuture<Product>> inFlight
= new ConcurrentHashMap<>();
public Product getProduct(String id) {
String cacheKey = "product:" + id;
// Check cache
Product cached = cache.get(cacheKey);
if (cached != null) return cached;
// Coalesce: only one thread fetches from DB
CompletableFuture<Product> future = inFlight.computeIfAbsent(
cacheKey,
k -> CompletableFuture.supplyAsync(() -> {
try {
Product product = db.getProduct(id);
cache.set(cacheKey, product, getTTLWithJitter());
return product;
} finally {
inFlight.remove(cacheKey);
}
})
);
return future.join();
}
When to use: When you have read heavy workloads with popular keys that many servers request simultaneously.
Limitation: This works within a single process. For distributed systems with multiple server instances, you need distributed locking (solution 3).
3. Distributed Locking with Stale Data Fallback
Only one request across your entire cluster refreshes the cache. Everyone else gets slightly stale data while the refresh happens.
flowchart TD
A[Request Arrives] --> B{Cache Hit?}
B -->|Yes| C[Return Cached Data]
B -->|No| D{Acquire Lock?}
D -->|Got Lock| E[Fetch from Database]
D -->|Lock Held| F{Stale Data Available?}
E --> G[Update Cache]
G --> H[Release Lock]
H --> C
F -->|Yes| I[Return Stale Data]
F -->|No| J[Wait and Retry]
J --> B
style C fill:#dcfce7,stroke:#16a34a,stroke-width:2px
style E fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style I fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style J fill:#f1f5f9,stroke:#64748b,stroke-width:2px
Implementation using Redis:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import redis
import json
r = redis.Redis()
LOCK_TTL = 5 # Lock expires in 5 seconds
STALE_TTL = 7200 # Stale data kept for 2 hours
FRESH_TTL = 3600 # Fresh data TTL: 1 hour
def get_product(product_id):
cache_key = f"product:{product_id}"
lock_key = f"lock:{cache_key}"
stale_key = f"stale:{cache_key}"
# Step 1: Try fresh cache
cached = r.get(cache_key)
if cached:
return json.loads(cached)
# Step 2: Try to acquire lock
lock_acquired = r.set(lock_key, "1", nx=True, ex=LOCK_TTL)
if lock_acquired:
try:
# We got the lock. Fetch from database.
product = db.get_product(product_id)
# Update both fresh and stale caches
ttl = FRESH_TTL + random.randint(-300, 300) # jitter
r.set(cache_key, json.dumps(product), ex=ttl)
r.set(stale_key, json.dumps(product), ex=STALE_TTL)
return product
finally:
r.delete(lock_key)
else:
# Someone else is refreshing. Return stale data.
stale = r.get(stale_key)
if stale:
return json.loads(stale)
# No stale data available. Wait briefly and retry.
time.sleep(0.1)
return get_product(product_id)
The trick is the stale key. You keep a copy of the data with a much longer TTL. When the fresh cache expires and someone else holds the lock, other requests return this stale copy instead of hitting the database or waiting.
When to use: When you can tolerate briefly stale data (a few seconds) and need cluster wide protection.
Watch out for: Lock timeouts. If the request holding the lock crashes, you need the lock to auto expire (that is what the ex=LOCK_TTL does).
4. Probabilistic Early Recomputation
Instead of waiting for the cache to expire and dealing with the stampede, refresh it before expiry. The idea is simple: as a cache entry gets closer to its expiry time, requests start probabilistically refreshing it in the background.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import random
import time
import math
def get_with_early_recompute(key, fetch_fn, ttl=3600):
cached = cache.get_with_metadata(key)
if cached is None:
# Complete miss. Fetch and cache.
value = fetch_fn()
cache.set(key, value, ttl=ttl + random.randint(-300, 300))
return value
value, created_at, expire_at = cached
remaining = expire_at - time.time()
total_ttl = expire_at - created_at
# As we approach expiry, increase probability of refresh
# When 10% TTL remaining, ~63% chance of refresh
beta = 1.0 # Tuning parameter
probability = math.exp(-beta * remaining / total_ttl)
if random.random() < probability:
# Refresh in background
threading.Thread(
target=lambda: cache.set(key, fetch_fn(), ttl=ttl)
).start()
return value
This technique is called XFetch in academic literature. The cache entry gets refreshed during normal traffic before it actually expires. When the TTL hits zero, the cache already has a fresh value.
When to use: For predictable, high traffic keys where you cannot afford even a brief stampede window.
Trade off: Uses more database resources during normal operation because refreshes happen before necessary.
5. Write Buffering with Message Queues
For write heavy scenarios (like social media feeds during viral events), the thundering herd comes from writes, not reads. The solution: buffer writes in a message queue and process them at a sustainable rate.
flowchart LR
subgraph clients["Incoming Writes"]
W1["Write 1"]
W2["Write 2"]
W3["Write 3"]
W4["Write N..."]
end
subgraph buffer["Message Queue"]
Q["Kafka / RabbitMQ / SQS"]
end
subgraph processing["Controlled Processing"]
P["Consumer<br/>Rate: 500/sec"]
end
subgraph storage["Database"]
DB["PostgreSQL"]
end
W1 --> Q
W2 --> Q
W3 --> Q
W4 --> Q
Q --> P
P --> DB
style Q fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style P fill:#dcfce7,stroke:#16a34a,stroke-width:2px
style DB fill:#f1f5f9,stroke:#64748b,stroke-width:2px
The queue acts as a shock absorber. It accepts writes at whatever rate they come in but feeds them to the database at a controlled, sustainable rate.
This is exactly how Kafka is used at companies like LinkedIn and Uber. They do not send events directly to the database. Everything goes through Kafka first, and consumers process events at their own pace.
When to use: Write heavy workloads, event processing, anything where the write rate can spike unpredictably.
6. Admission Control and Load Shedding
Sometimes the best solution is to say no. When the system detects it is overloaded, it starts rejecting excess requests instead of trying to process all of them and failing on all of them.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import time
from collections import deque
class AdaptiveLoadShedder:
def __init__(self, max_concurrent=500, window_seconds=10):
self.max_concurrent = max_concurrent
self.current = 0
self.recent_latencies = deque(maxlen=1000)
self.baseline_latency = None
def should_accept(self):
# Reject if too many concurrent requests
if self.current >= self.max_concurrent:
return False
# Reject if latency has spiked (sign of overload)
if self.baseline_latency and self.recent_latencies:
avg_latency = sum(self.recent_latencies) / len(self.recent_latencies)
if avg_latency > self.baseline_latency * 3:
# Latency 3x above normal. Shed load.
return False
return True
def record_request(self, latency):
self.recent_latencies.append(latency)
if self.baseline_latency is None:
self.baseline_latency = latency
This is the approach behind the rate limiter pattern. Rather than processing everything and crashing, the system degrades gracefully by prioritizing requests it can actually handle.
When to use: As a safety net alongside other solutions. Load shedding is your last line of defense.
How Big Tech Companies Solved It
Facebook/Meta: The Lease Mechanism
Facebook’s 2013 paper “Scaling Memcache at Facebook” described one of the most influential solutions to the thundering herd problem.
At Facebook’s scale, a single popular cache key (like a celebrity profile) could be requested thousands of times per second. When that key expired, the resulting stampede could take down their MySQL shards.
Their solution: leases.
sequenceDiagram
participant S1 as Server 1
participant S2 as Server 2
participant S3 as Server 3
participant MC as Memcached
participant DB as MySQL
S1->>MC: GET celebrity:123
MC-->>S1: MISS + Lease Token #42
S2->>MC: GET celebrity:123
MC-->>S2: MISS (no lease, wait)
S3->>MC: GET celebrity:123
MC-->>S3: MISS (no lease, wait)
S1->>DB: SELECT * FROM users...
DB-->>S1: User data
S1->>MC: SET celebrity:123<br/>with Lease #42
Note over MC: Lease verified.<br/>Cache updated.
S2->>MC: GET celebrity:123
MC-->>S2: HIT (fresh data)
S3->>MC: GET celebrity:123
MC-->>S3: HIT (fresh data)
How it works:
- When a cache miss occurs, Memcached returns a lease token to the first requester
- Only the request with the valid lease token can write to the cache
- Other requests that see the miss are told to wait a short time (usually 10ms) and retry
- By the time they retry, the first request has already populated the cache
The lease has a timeout. If the first request crashes or takes too long, the lease expires and a new one is issued.
This single change reduced Facebook’s peak database load by orders of magnitude. The paper reported that without leases, a single hot key could generate thousands of database queries. With leases, it generated exactly one.
For more on how Meta handles cache consistency at scale, see How Meta Achieves 99.99999999% Cache Consistency.
Twitter/X: Write Buffering for Viral Events
Twitter faces a unique thundering herd challenge. When a celebrity tweets, millions of users need to see it in their timeline. The write path, fanning out the tweet to all followers, can generate millions of writes in a fraction of a second.
Their approach: don’t try to absorb the spike. Smooth it.
Instead of writing directly to the timeline database, tweets go into a message queue (Kafka). Timeline fanout workers consume from the queue at a controlled rate that the database can handle. The user sees the tweet immediately (optimistic UI), but the actual write to each follower’s timeline happens asynchronously over seconds or minutes.
For extremely popular accounts (those with millions of followers), Twitter uses a hybrid approach:
- Regular users: Fan out on write. When you tweet, the system writes your tweet to each follower’s timeline.
- Celebrity accounts: Fan out on read. The tweet is stored once, and when a follower opens their timeline, it is merged in at read time.
This prevents a single celebrity tweet from generating 50 million simultaneous writes.
AWS: Exponential Backoff with Jitter
AWS’s approach is documented in their architecture blog. All AWS SDKs implement exponential backoff with full jitter by default.
They identified three strategies and measured their effectiveness:
| Strategy | How It Works | Thundering Herd Prevention |
|---|---|---|
| Full Jitter | sleep = random(0, min(cap, base * 2^attempt)) |
Best. Spreads retries evenly. |
| Equal Jitter | Half exponential, half random | Good. Some clustering remains. |
| Decorrelated Jitter | sleep = min(cap, random(base, sleep * 3)) |
Good. Self-adjusting. |
AWS recommends full jitter for most cases because it produces the least amount of work and the most even spread of retries.
Netflix: Request Coalescing with Hystrix
Netflix built request coalescing into their Hystrix library (now succeeded by Resilience4j). When multiple threads request the same data within a short time window, Hystrix batches them into a single backend call.
Their implementation goes further than basic singleflight by also implementing:
- Circuit breakers: If a backend is failing, stop sending requests entirely
- Bulkheads: Isolate different types of requests so one thundering herd does not affect unrelated requests
- Fallbacks: Return cached data, default values, or degraded responses when the primary call fails
This is a defense in depth approach. Request coalescing prevents the stampede. Circuit breakers detect it early. Bulkheads contain the blast radius. Fallbacks keep users happy while recovery happens.
Go Standard Library: singleflight
Go’s approach is the most developer friendly. The golang.org/x/sync/singleflight package provides request coalescing as a standard library utility. It is widely used in Go services at companies like Cloudflare, where they handle millions of requests per second at their CDN edge.
When thousands of requests arrive for the same cached resource during a cache miss, singleflight ensures only one goroutine calls the backend. The rest simply wait for that result. This pattern is so effective that it has become the default approach for cache access in most Go services.
Practical Implementation Guide
Let me put it all together. Here is a production ready cache layer that combines multiple solutions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import redis
import json
import random
import time
import threading
from contextlib import contextmanager
class ThunderingHerdCache:
"""
Cache layer that prevents thundering herd using:
1. Jittered TTLs
2. Distributed locking
3. Stale data fallback
4. Probabilistic early recomputation
"""
def __init__(self, redis_client, base_ttl=3600, jitter=300,
lock_ttl=5, stale_ttl=7200):
self.redis = redis_client
self.base_ttl = base_ttl
self.jitter = jitter
self.lock_ttl = lock_ttl
self.stale_ttl = stale_ttl
def _get_ttl(self):
"""TTL with jitter to prevent synchronized expiration."""
return self.base_ttl + random.randint(-self.jitter, self.jitter)
def get(self, key, fetch_fn):
"""
Get a value from cache with thundering herd protection.
Args:
key: Cache key
fetch_fn: Function to call on cache miss
Returns:
Cached or freshly fetched value
"""
# Step 1: Check fresh cache
cached = self.redis.get(f"fresh:{key}")
if cached:
return json.loads(cached)
# Step 2: Try to acquire lock
lock_key = f"lock:{key}"
lock_acquired = self.redis.set(
lock_key, "1", nx=True, ex=self.lock_ttl
)
if lock_acquired:
try:
# Double-check cache after acquiring lock
cached = self.redis.get(f"fresh:{key}")
if cached:
return json.loads(cached)
# Fetch from source
value = fetch_fn()
serialized = json.dumps(value)
# Set fresh cache with jittered TTL
ttl = self._get_ttl()
self.redis.set(f"fresh:{key}", serialized, ex=ttl)
# Set stale cache with longer TTL
self.redis.set(f"stale:{key}", serialized, ex=self.stale_ttl)
return value
finally:
self.redis.delete(lock_key)
else:
# Someone else is refreshing. Return stale data.
stale = self.redis.get(f"stale:{key}")
if stale:
return json.loads(stale)
# No stale data. Wait briefly and retry once.
time.sleep(0.05) # 50ms
cached = self.redis.get(f"fresh:{key}")
if cached:
return json.loads(cached)
# Last resort: fetch directly (rare edge case)
return fetch_fn()
# Usage
cache = ThunderingHerdCache(redis.Redis())
product = cache.get(
key="product:123",
fetch_fn=lambda: db.get_product(123)
)
Architecture Overview
Here is how all these solutions fit together in a production system:
flowchart TD
A["Incoming Request"] --> B["Check Local Cache"]
B -->|Hit| C["Return Response"]
B -->|Miss| D["Check Distributed Cache<br/>Redis / Memcached"]
D -->|Hit| C
D -->|Miss| E{"Request Coalescing<br/>Same key in flight?"}
E -->|Yes| F["Wait for Result"]
F --> C
E -->|No| G{"Acquire Lock"}
G -->|Got Lock| H["Query Database"]
G -->|Lock Held| I{"Stale Data?"}
I -->|Yes| J["Return Stale Data"]
I -->|No| K["Wait + Retry"]
K --> D
H --> L["Update Cache<br/>Jittered TTL"]
L --> C
style C fill:#dcfce7,stroke:#16a34a,stroke-width:2px
style J fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style H fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style L fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
Which Solution Should You Pick?
| Scenario | Recommended Solution |
|---|---|
| Just starting out | Jitter on all TTLs and retries |
| Read heavy with hot keys | Request coalescing (singleflight) + jitter |
| Distributed microservices | Distributed lock + stale data fallback |
| Predictable high traffic keys | Probabilistic early recomputation |
| Write heavy spikes | Message queue buffering |
| Safety net for everything | Admission control / load shedding |
Most production systems combine 2 or 3 of these. A common and effective combination:
- Jitter on all TTLs (prevents synchronized expiration)
- Distributed locking with stale fallback (prevents stampede when misses do happen)
- Load shedding as a safety net (prevents total system failure)
If you are using the cache-aside pattern (the most common caching strategy), adding jitter and distributed locking will prevent the vast majority of thundering herd scenarios.
Lessons Learned
After researching how companies from Facebook to Netflix handle this problem, a few lessons stand out.
1. Correlated Behavior Is the Enemy
The thundering herd is not a traffic problem. It is a synchronization problem. Your system might handle 50,000 requests per second just fine. But 50,000 requests arriving for the same resource at the same millisecond will take it down.
Design for worst case correlation, not average case independence. Ask yourself: what happens when every cache key in this namespace expires at the same time? What happens when all 100 servers restart at once?
2. Defense in Depth Works
No single solution is enough. Facebook does not just use leases. They use leases AND jitter AND admission control AND graceful degradation. Netflix does not just use request coalescing. They add circuit breakers, bulkheads, and fallbacks on top.
Layer your defenses. If one fails, the next catches it.
3. Stale Data Is Better Than No Data
Most users would rather see a product page with prices from 5 seconds ago than see an error page. Design your system to serve stale data during recovery periods.
The exception: financial transactions, inventory counts, and anything where stale data causes real harm. For those, use write through caching with strong consistency.
4. Test for It Before Production Teaches You
Many teams discover the thundering herd problem at 2 AM during a production outage. You can find it earlier by:
- Running load tests that simulate cache expiration under peak traffic
- Testing service restart scenarios with realistic connection counts
- Simulating network partitions and observing recovery behavior
Tools like Grafana k6 can simulate these scenarios in a controlled environment.
5. Monitor the Right Metrics
Standard metrics like average latency and error rate will not catch a thundering herd forming. Watch for:
- Cache miss ratio spikes: A sudden jump from 1% to 50% cache miss rate
- Database connection pool exhaustion: All connections in use at once
- Request latency percentiles: p99 latency shooting up while p50 stays normal
- Retry rate: A spike in retry requests signals a stampede in progress
6. The Best Fix Is Prevention
Jitter costs nothing to implement. Adding random.randint(-300, 300) to your TTLs takes five minutes and prevents the most common trigger of the thundering herd problem. Do it for every TTL and every retry delay in your system. Today.
Further Reading
- Scaling Memcache at Facebook - The original paper on leases for thundering herd prevention
- AWS Architecture Blog: Exponential Backoff and Jitter - Analysis of jitter strategies
- Go singleflight package - Request coalescing in Go
- Caching Strategies Explained - Deep dive into cache patterns
- Role of Queues in System Design - How queues absorb traffic spikes
- How Kafka Works - The engine behind write buffering at scale
Conclusion
The thundering herd problem is one of those issues that separates systems that work under normal conditions from systems that work under all conditions. Every system that uses caching will eventually face it. The question is whether you have prepared for it or you will discover it during an outage.
The good news: the solutions are well understood and battle tested. Start with jitter on your TTLs. Add request coalescing for hot keys. Implement distributed locking with stale data fallback for your distributed cache. And always have load shedding as your safety net.
These are not theoretical patterns. Facebook runs leases on billions of cache operations daily. Twitter buffers millions of writes through queues. Netflix wraps every external call in circuit breakers. These companies did not start with these solutions. They learned from outages and built them in over time.
You do not have to wait for the 3 AM page to start.