December 28th, 2011. Web servers across the internet started crashing. PHP, Java, Python, Ruby applications all fell like dominos. The culprit? A few carefully crafted HTTP POST requests with specific parameter names.
The attackers had found something interesting: they could send data that would hash to the same value repeatedly. Instead of O(1) lookups, hash tables degraded to O(n). A single request with 65,000 parameters could take 30 seconds to process, bringing servers to their knees.
This wasn’t a buffer overflow or SQL injection. This was a hash collision attack. And it worked because of a fundamental property of hash tables that most developers don’t fully understand.
Let’s explore what hash collisions are, why they matter, and how to handle them properly.
The Problem: When Two Keys End Up in the Same Bucket
Think of a hash table like an apartment building with 100 units. When someone wants to rent an apartment, you run their name through a function that tells you which unit to assign them to:
1
2
3
function assignApartment(name) {
return name.length % 100; // Simple hash function
}
“Alice” (5 letters) gets apartment 5. “Charlie” (7 letters) gets apartment 7. This works great until “Steve” (5 letters) shows up. According to the function, Steve also gets apartment 5. But Alice is already there.
This is a collision: two different keys producing the same hash value.
In real hash tables, collisions are inevitable. Here’s why:
graph TB
subgraph "Infinite Possible Keys"
K1[user@email.com]
K2[another@email.com]
K3[third@email.com]
K4[fourth@email.com]
K5[...]
K6[billions of possible keys]
end
subgraph "Finite Hash Table"
H1[Bucket 0]
H2[Bucket 1]
H3[Bucket 2]
H4[Bucket 3]
H5[...]
end
K1 --> H2
K2 --> H2
K3 --> H1
K4 --> H2
K6 --> H3
style H2 fill:#ff6b6b
You have an infinite number of possible keys, but only a fixed number of buckets. By the pigeonhole principle, collisions are guaranteed to happen.
Why Collisions Matter: From O(1) to O(n)
Hash tables promise O(1) average-case performance. But that “average case” assumes collisions are rare and distributed evenly.
When collisions pile up in the same bucket, performance degrades:
graph LR
A[Hash Table with<br/>No Collisions] --> B[O one lookup]
C[Hash Table with<br/>Many Collisions<br/>in One Bucket] --> D[O n lookup<br/>Must search through list]
style B fill:#4caf50
style D fill:#ff6b6b
Let’s see this in action:
A simple test shows the impact:
1
2
3
4
5
6
7
8
9
10
11
12
# Bad hash: everything goes to bucket 0
def bad_hash(key):
return 0
# Good hash: distributed evenly
def good_hash(key):
return hash(key) % 1000
# Result with 10,000 insertions:
# Good hash: 0.0023s
# Bad hash: 0.8912s
# Slowdown: 387.5x
The bad hash function causes a 387x slowdown. This is why hash collisions matter.
Collision Resolution: Two Main Approaches
Important: No strategy prevents collisions. They’re mathematically inevitable (pigeonhole principle). What we can do is handle them gracefully when they occur.
When a collision happens, we need a strategy to deal with it. There are two main approaches:
1. Separate Chaining: The Linked List Approach
Separate chaining doesn’t prevent collisions. Instead, it accepts them gracefully by storing multiple entries in the same bucket. Each bucket holds a linked list of all entries that hash to that location.
graph TB
subgraph Hash Table
B0[Bucket 0] --> Empty0[ ]
B1[Bucket 1] --> Node1[Alice: 25]
Node1 --> Node2[Steve: 30]
Node2 --> Node3[David: 28]
B2[Bucket 2] --> Node4[Bob: 35]
B3[Bucket 3] --> Empty3[ ]
B4[Bucket 4] --> Node5[Eve: 22]
end
style Node1 fill:#4caf50
style Node2 fill:#4caf50
style Node3 fill:#4caf50
style B1 fill:#ffeb3b
Notice Bucket 1: Alice, Steve, and David all hashed to the same bucket. Instead of overwriting each other, they form a chain.
How it handles collisions:
- Insert: Hash the key to find the bucket. If something’s already there, add to the end of the chain.
- Lookup: Hash the key, go to that bucket, then walk through the chain comparing keys until you find a match.
- No data loss: Every entry gets stored, even with collisions.
Example of collision handling:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Both "Alice" and "Steve" hash to bucket 1
hash("Alice") = 1
hash("Steve") = 1 # Collision!
# Step 1: Insert Alice
table[1] = [("Alice", 25)]
# Step 2: Insert Steve (collision occurs)
# Instead of overwriting, we add to the chain
table[1] = [("Alice", 25), ("Steve", 30)]
# Step 3: Lookup Steve
# Go to bucket 1, walk through chain:
# - Check "Alice" != "Steve", continue
# - Check "Steve" == "Steve", found it!
Why this works:
Collisions slow you down (you have to search the chain), but they don’t break anything. With a good hash function, most buckets have 0-2 items, so chains stay short and searches stay fast.
Pros:
- Simple to implement
- Never runs out of space (chains can grow infinitely)
- Works well even with high load factors
- Every piece of data is preserved
Cons:
- Extra memory for storing linked list pointers
- Poor cache locality (list nodes scattered in memory)
- Long chains slow down lookups (but rare with good hash functions)
2. Open Addressing: The Probing Approach
Instead of chains, open addressing finds another empty spot in the table when a collision occurs.
sequenceDiagram
participant K as Key "Bob"
participant H as Hash Function
participant T as Table
K->>H: hash("Bob") = 3
H->>T: Try bucket 3
T->>H: Already occupied
H->>T: Try bucket 4 (linear probe)
T->>H: Already occupied
H->>T: Try bucket 5
T->>H: Empty!
Note over T: Store "Bob" in bucket 5
Three probing strategies:
Linear Probing
Check the next bucket sequentially:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class HashTableLinearProbing:
def __init__(self, size=10):
self.size = size
self.table = [None] * size
self.keys = [None] * size
def insert(self, key, value):
bucket = hash(key) % self.size
# Linear probing: check next slots
while self.keys[bucket] is not None:
if self.keys[bucket] == key:
self.table[bucket] = value
return
bucket = (bucket + 1) % self.size
self.keys[bucket] = key
self.table[bucket] = value
def get(self, key):
bucket = hash(key) % self.size
while self.keys[bucket] is not None:
if self.keys[bucket] == key:
return self.table[bucket]
bucket = (bucket + 1) % self.size
raise KeyError(key)
Problem: Primary clustering
Linear probing creates clusters of occupied slots:
1
2
3
4
[X][X][X][X][ ][X][ ][ ][ ][ ]
↑
Cluster: any key hashing here
must probe through 4 slots
Quadratic Probing
Instead of checking the next slot, jump by quadratic distances (1², 2², 3²…). Reduces primary clustering but can cause secondary clustering.
Double Hashing
Use a second hash function to determine the step size. Best distribution, minimal clustering.
Open Addressing Comparison:
Method | Clustering | Complexity | Best For |
---|---|---|---|
Linear Probing | High (primary) | Simple | Cache-friendly, small tables |
Quadratic Probing | Medium (secondary) | Moderate | General purpose |
Double Hashing | Low | Complex | Large tables, high load |
Pros of Open Addressing:
- Better cache locality
- No extra memory for pointers
- Faster for small data types
Cons:
- Must keep load factor low (< 0.7)
- Deletion is tricky (can’t just remove)
- Table can fill up
Load Factor: The Critical Metric
The load factor is the ratio of stored entries to table size:
1
2
3
4
load_factor = n / m
n = number of entries
m = table size
Impact on performance:
graph LR
A[Load Factor 0.0] --> B[Fast but<br/>Memory Wasteful]
C[Load Factor 0.5] --> D[Good Balance]
E[Load Factor 0.9] --> F[Many Collisions<br/>Slow Performance]
style D fill:#4caf50
style F fill:#ff6b6b
Performance impact:
Load Factor | Insert Time | Lookup Time |
---|---|---|
0.25 | 0.0012s | 0.0008s |
0.50 | 0.0019s | 0.0013s |
0.75 | 0.0047s | 0.0039s |
0.90 | 0.0193s | 0.0187s |
At 90% load, operations are 16x slower than at 25% load.
Standard load factor thresholds:
- Separate chaining: 1.0 (rehash when table is full)
- Open addressing: 0.7 (rehash at 70% full)
Real-World Implementations
Java HashMap: From Linked Lists to Trees
Java’s HashMap uses separate chaining, but with a twist. When a bucket gets too many collisions, it switches from a linked list to a balanced tree.
graph TB
subgraph "Normal Bucket"
N1[Entry 1] --> N2[Entry 2] --> N3[Entry 3]
end
subgraph "High Collision Bucket"
T1[Root]
T1 --> T2[Left Child]
T1 --> T3[Right Child]
T2 --> T4[Entry 4]
T2 --> T5[Entry 5]
T3 --> T6[Entry 6]
T3 --> T7[Entry 7]
end
style T1 fill:#4caf50
The logic:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Simplified version of Java 8+ HashMap
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
// When a bucket has 8+ entries
if (binCount >= TREEIFY_THRESHOLD - 1) {
treeifyBin(tab, hash); // Convert to tree
}
// When entries drop below 6
if (node instanceof TreeNode) {
if (size < UNTREEIFY_THRESHOLD) {
// Convert back to list
}
}
Why this works:
- Most buckets have 0-3 entries (good hash distribution)
- A few buckets might get many entries (collision attack)
- Trees limit worst-case lookup to O(log n) instead of O(n)
This change in Java 8 prevented the 2011-style hash collision attacks.
Python Dictionary: Compact and Fast
Python uses open addressing with a sophisticated probing algorithm:
1
2
3
4
5
6
# Python's probing (simplified)
def python_probe(hash_value, i, table_size):
perturb = hash_value
j = (5 * i + perturb + 1) % table_size
perturb >>= 5 # Shift right by 5 bits
return j
Key innovations:
- Compact layout: Keys and values stored separately for better cache performance
- Perturbed probing: Uses upper bits of hash to avoid clustering
- Dynamic resizing: Grows at 67% load factor
The 2011 Hash Collision Attack
On December 28, 2011, researchers revealed a vulnerability affecting most web frameworks. Here’s how it worked:
The Attack Mechanism
Attackers found keys that all hash to the same value:
1
2
3
4
5
6
7
8
9
10
def simple_hash(s):
"""Vulnerable hash function"""
h = 0
for char in s:
h = (h * 33) + ord(char)
return h % 1000
# Attacker generates 65,000 keys that all hash to value 42
# Then sends them as HTTP POST parameters
# POST /?key_1=val&key_2=val&...&key_65000=val
The impact:
graph TB
A[Normal Request<br/>100 parameters] --> B[100 hash operations<br/>~0.001s]
C[Attack Request<br/>65,000 colliding parameters] --> D[65,000 hash operations<br/>+ 65,000 * 32,500 comparisons<br/>~30 seconds]
style B fill:#4caf50
style D fill:#ff6b6b
A single request could tie up a CPU core for 30 seconds. A few dozen requests would crash the server.
The Fix: Randomized Hash Seeds
The solution was to add randomness to hash functions:
1
2
3
4
5
6
7
8
9
10
11
import random
class SecureHashTable:
def __init__(self, size=10):
self.size = size
self.table = [[] for _ in range(size)]
self.seed = random.randint(0, 2**32 - 1) # Random seed per instance
def _hash(self, key):
# Include seed in hash calculation
return hash((key, self.seed)) % self.size
Why this works:
- Each server instance has a different seed
- Attacker can’t predict which keys will collide
- Pre-computed collision sets become useless
Modern languages now use randomized hashing by default.
Choosing a Good Hash Function
A good hash function should:
- Distribute keys uniformly
- Be fast to compute
- Minimize collisions
Bad Hash Function Example
1
2
3
4
5
6
7
8
def bad_hash(s):
"""Only looks at first character"""
return ord(s[0]) % 100
# All these collide:
bad_hash("apple") # 97 % 100 = 97
bad_hash("avocado") # 97 % 100 = 97
bad_hash("apricot") # 97 % 100 = 97
Better Hash Function
1
2
3
4
5
6
7
8
9
10
11
def djb2_hash(s):
"""DJB2 hash - well-distributed"""
h = 5381
for char in s:
h = ((h << 5) + h) + ord(char) # h * 33 + c
return h
# Much better distribution
djb2_hash("apple") # 177573
djb2_hash("avocado") # 5860410
djb2_hash("apricot") # 5859955
Production Hash Functions
Function | Speed | Collision Rate | Use Case |
---|---|---|---|
MurmurHash3 | Very Fast | Low | General purpose, non-crypto |
xxHash | Extremely Fast | Very Low | High performance apps |
CityHash | Fast | Low | Google’s general hasher |
SHA-256 | Slow | Cryptographic | Security-critical data |
Practical Tips: Handling Collisions in Production
1. Monitor Your Load Factor
Track the ratio of entries to table size. Rehash when it exceeds 0.75 for chaining or 0.7 for open addressing.
1
2
3
load_factor = entries / table_size
if load_factor > 0.75:
resize_and_rehash()
2. Choose the Right Initial Size
1
2
# If you expect 1000 entries, start with size 1333
initial_size = expected_entries / 0.75 # 1000 / 0.75 = 1333
3. Use Language Built-ins
Don’t reinvent the wheel. Use built-in hash tables:
- Python:
dict
- Java:
HashMap
- JavaScript:
Map
- Rust:
HashMap
They handle optimal hash functions, load factor management, automatic resizing, and security (randomized hashing).
When Collisions Become a Feature
Interestingly, hash collisions aren’t always bad. Some data structures intentionally use collisions:
Bloom Filters
Bloom filters use multiple hash functions and expect collisions. They can tell you “definitely not present” or “maybe present” with a small memory footprint.
Used by Chrome (malicious URLs), Cassandra (reducing disk reads), and Bitcoin wallets (checking transactions).
Key Takeaways
Hash collisions are inevitable, but manageable:
- Choose the right collision resolution
- Separate chaining: Simple, handles high load
- Open addressing: Faster, better cache performance
- Monitor load factor
- Chaining: < 1.0
- Open addressing: < 0.7
- Use good hash functions
- Uniform distribution
- Fast computation
- Randomized for security
- Watch for performance degradation
- Profile your hash distribution
- Rehash when load factor gets too high
- Trust modern implementations
- Java HashMap: Switches to trees at 8+ collisions
- Python dict: Compact and fast open addressing
- Modern languages: Randomized hashing prevents attacks
Remember: Your hash table’s O(1) promise depends entirely on how well you handle collisions. Choose wisely, monitor closely, and rehash when needed.
For more on data structures and algorithms, check out our posts on Distributed Systems, Write-Ahead Logs, and System Design at Scale.