You are building a system that needs to check if a username is already taken. You have 100 million users. Storing all usernames in memory would use gigabytes of RAM.
But what if you could answer “is this username taken?” using only 120 MB of memory? And the check takes constant time, regardless of how many users you have?
That is what Bloom filters do. They trade perfect accuracy for massive memory savings. And for many use cases, that trade-off is worth it.
What is a Bloom Filter?
A Bloom filter is a data structure that answers one question: “Is this element in the set?”
The answer is either:
- Definitely NO - the element is not in the set
- Probably YES - the element might be in the set
That “probably” is important. Bloom filters can have false positives. They might say “yes” when the answer is actually “no”. But they never have false negatives. If they say “no”, the element is definitely not there.
Think of it like a bouncer with a fuzzy memory. He never forgets someone who is actually on the guest list. But occasionally he lets in someone who just looks familiar.
Why Would You Accept False Positives?
Because the memory savings are dramatic.
Let’s say you want to track 1 billion usernames. A traditional hash set would need to store all those strings. With an average username of 15 characters, that’s at least 15 GB just for the strings, plus hash table overhead. Call it 20 GB or more.
A Bloom filter for the same 1 billion items with 1% false positive rate needs about 1.2 GB. That’s 16x less memory.
| Approach | Memory for 1 Billion Items |
|---|---|
| Hash Set (storing strings) | ~20 GB |
| Bloom Filter (1% false positive) | ~1.2 GB |
| Bloom Filter (0.1% false positive) | ~1.8 GB |
The trick is that a Bloom filter does not store the actual elements. It just stores enough information to answer “maybe” or “definitely not”.
How Bloom Filters Work
A Bloom filter has two components:
- A bit array - a fixed-size array of bits, all initialized to 0
- Multiple hash functions - functions that map any input to a position in the array
Adding an Element
To add “hello” to the filter:
- Hash “hello” with each hash function
- Each hash gives you a position in the bit array
- Set all those positions to 1
flowchart LR
A["hello"] --> H1["Hash 1"]
A --> H2["Hash 2"]
A --> H3["Hash 3"]
H1 --> P1["Position 2"]
H2 --> P2["Position 5"]
H3 --> P3["Position 9"]
P1 --> B["Bit Array"]
P2 --> B
P3 --> B
style A fill:#e0f2fe,stroke:#0369a1,stroke-width:2px
style B fill:#dcfce7,stroke:#16a34a,stroke-width:2px
After adding “hello”, positions 2, 5, and 9 are set to 1:
1
2
Index: 0 1 2 3 4 5 6 7 8 9 10 11
Bits: 0 0 1 0 0 1 0 0 0 1 0 0
Checking for an Element
To check if “hello” exists:
- Hash “hello” with the same hash functions
- Check if ALL those positions are 1
- If any position is 0, “hello” was never added
- If all positions are 1, “hello” was probably added
flowchart TD
Q["Check: hello"] --> H["Hash with all functions"]
H --> C["Check positions 2, 5, 9"]
C --> D{All bits = 1?}
D -->|Yes| Y["Probably in set"]
D -->|No| N["Definitely NOT in set"]
style Q fill:#e0f2fe,stroke:#0369a1,stroke-width:2px
style Y fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style N fill:#fee2e2,stroke:#dc2626,stroke-width:2px
Where False Positives Come From
Let’s add another element, “world”:
1
2
3
4
5
6
7
8
9
Before adding "world":
Index: 0 1 2 3 4 5 6 7 8 9 10 11
Bits: 0 0 1 0 0 1 0 0 0 1 0 0
Hash("world") → positions 2, 7, 10
After adding "world":
Index: 0 1 2 3 4 5 6 7 8 9 10 11
Bits: 0 0 1 0 0 1 0 1 0 1 1 0
Notice that position 2 was already set by “hello”. No problem, it stays at 1.
Now imagine we check for “foo”, which was never added:
1
Hash("foo") → positions 2, 5, 9
We check positions 2, 5, and 9. They are all 1 (set by “hello”). The Bloom filter says “probably yes” even though “foo” was never added. That is a false positive.
The more elements you add, the more bits get set to 1, and the higher the chance of false positives.
The Math Behind Bloom Filters
The false positive probability depends on three things:
- m = size of the bit array
- n = number of elements inserted
- k = number of hash functions
The false positive probability is approximately:
1
p = (1 - e^(-kn/m))^k
To achieve a target false positive rate p with n elements, the optimal settings are:
1
2
m = -(n × ln(p)) / (ln(2)^2)
k = (m/n) × ln(2)
In plain English:
| Elements | Target False Positive | Bit Array Size | Hash Functions |
|---|---|---|---|
| 10,000 | 1% | ~12 KB | 7 |
| 100,000 | 1% | ~120 KB | 7 |
| 1,000,000 | 1% | ~1.2 MB | 7 |
| 1,000,000 | 0.1% | ~1.8 MB | 10 |
| 10,000,000 | 1% | ~12 MB | 7 |
For most applications, 7 hash functions with 10 bits per element gives about 1% false positive rate.
Where Bloom Filters Are Used
Bloom filters show up everywhere in production systems.
Databases: Avoiding Disk Reads
Google Bigtable, Apache HBase, Apache Cassandra, and LevelDB all use Bloom filters.
When you query for a row that does not exist, the database could search the entire data file on disk. That is slow. Instead, it checks a Bloom filter first:
sequenceDiagram
participant App
participant BloomFilter
participant Disk
App->>BloomFilter: Is key "user_123" in SSTable?
alt Bloom filter says NO
BloomFilter-->>App: Definitely not present
Note over App: Skip disk read entirely
else Bloom filter says MAYBE
BloomFilter-->>App: Possibly present
App->>Disk: Read SSTable from disk
Disk-->>App: Return data or "not found"
end
Cassandra reports that Bloom filters typically prevent 90%+ of unnecessary disk reads. For workloads with many queries for non-existent keys, this is a massive performance win.
Web Browsers: Malicious URL Checking
Google Chrome needs to check if a URL is in the list of known malicious sites. The full list is too large to download to every browser. Instead, Chrome uses a Bloom filter:
- A compact Bloom filter is distributed with the browser
- When you visit a URL, Chrome checks the local Bloom filter
- If it says “no”, the URL is safe
- If it says “maybe”, Chrome checks the full list on Google’s servers
This keeps most URL checks fast and local while catching dangerous sites.
CDNs: Preventing Cache Pollution
Akamai uses Bloom filters to avoid caching “one-hit wonders” - content that is requested once and never again. Caching everything would fill the cache with useless data.
Their approach:
- First request for content: add URL to Bloom filter, do not cache
- Second request: check Bloom filter, it says “maybe”, now cache it
Content needs at least two requests before it gets cached. The Bloom filter tracks what has been seen without storing every URL.
Content Platforms: Tracking User History
Medium uses Bloom filters to track which articles a user has already seen. Instead of storing a list of article IDs per user, they store a Bloom filter per user.
When generating recommendations, they check the filter to avoid showing the same content twice. A false positive just means occasionally hiding an article the user has not seen. That is acceptable.
Implementation Example
Here is a simple Bloom filter in Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import hashlib
import math
class BloomFilter:
def __init__(self, expected_elements, false_positive_rate):
# Calculate optimal size and hash count
self.size = self._optimal_size(expected_elements, false_positive_rate)
self.hash_count = self._optimal_hash_count(self.size, expected_elements)
self.bit_array = [0] * self.size
def _optimal_size(self, n, p):
"""Calculate optimal bit array size."""
m = -(n * math.log(p)) / (math.log(2) ** 2)
return int(m)
def _optimal_hash_count(self, m, n):
"""Calculate optimal number of hash functions."""
k = (m / n) * math.log(2)
return int(k)
def _get_hash_positions(self, item):
"""Get all hash positions for an item."""
positions = []
for i in range(self.hash_count):
# Create different hashes by appending seed
hash_input = f"{item}:{i}".encode()
hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
position = hash_value % self.size
positions.append(position)
return positions
def add(self, item):
"""Add an item to the filter."""
for position in self._get_hash_positions(item):
self.bit_array[position] = 1
def might_contain(self, item):
"""Check if item might be in the filter."""
for position in self._get_hash_positions(item):
if self.bit_array[position] == 0:
return False # Definitely not present
return True # Possibly present
# Usage
bf = BloomFilter(expected_elements=1000, false_positive_rate=0.01)
# Add some items
bf.add("hello")
bf.add("world")
bf.add("foo")
# Check membership
print(bf.might_contain("hello")) # True (correct)
print(bf.might_contain("world")) # True (correct)
print(bf.might_contain("bar")) # False (correct)
print(bf.might_contain("baz")) # Might be True (false positive possible)
For production use, use established libraries:
- Python:
pybloom-live,python-bloomfilter - Java: Guava’s
BloomFilter - Go:
willf/bloom - Redis: Built-in Bloom filter module
Bloom Filter Variants
The standard Bloom filter has limitations. Several variants address them.
Counting Bloom Filter
Problem: Standard Bloom filters do not support deletion. You cannot set a bit back to 0 because other elements might share that position.
Solution: Instead of bits (0 or 1), use counters. Increment when adding, decrement when deleting.
1
2
Standard: [0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0]
Counting: [0, 0, 2, 0, 0, 1, 0, 1, 0, 3, 0, 0]
Trade-off: 3-4x more memory because each position needs multiple bits to store a count.
Scalable Bloom Filter
Problem: Standard Bloom filters have a fixed size. If you add more elements than expected, the false positive rate increases.
Solution: Start with a small filter. When it gets too full, add another filter on top. Check all filters when querying.
This is useful when you do not know the final size upfront.
Cuckoo Filter
Problem: Standard Bloom filters do not support deletion and can have high false positive rates.
Solution: Use a different approach based on cuckoo hashing. Store fingerprints (short hashes) in buckets. Supports deletion and often has better performance than Bloom filters for the same memory.
Trade-off: More complex implementation.
When to Use Bloom Filters
Good Use Cases
- Pre-filtering before expensive operations
- Checking if a key exists before disk read
- Tracking what a user has already seen
- Detecting duplicate events in streaming data
- Spell checkers and dictionary lookups
- Network routers checking packet membership
Avoid When
- You need to retrieve the actual elements
- False positives are not acceptable
- You need to delete elements frequently
- The dataset is small enough for a hash set
- Security decisions depend on the result
- You need exact counts
The key question: Can your application tolerate false positives?
If “yes” means triggering a secondary check (like reading from disk), that is usually fine. The Bloom filter is just an optimization.
If “yes” means letting someone into a system they should not access, do not use a Bloom filter.
Performance Comparison
| Operation | Bloom Filter | Hash Set | Sorted Array |
|---|---|---|---|
| Add | O(k) | O(1) average | O(n) |
| Lookup | O(k) | O(1) average | O(log n) |
| Space (1M items) | ~1.2 MB | ~20 MB | ~15 MB |
| False Positives | Yes (configurable) | No | No |
| Delete | No (standard) | Yes | Yes |
What does O(k) mean? It means the operation takes time proportional to k, the number of hash functions. To add an element, you compute k hashes and set k bits. To check membership, you compute k hashes and check k bits. Since k is small and fixed (typically 7-10), this is effectively constant time. Whether you have 1,000 or 1 billion elements, lookups still take the same number of operations.
Key Takeaways
-
Bloom filters trade accuracy for memory. They can tell you if something is “definitely not” or “probably yes” in the set.
-
False positives are possible, false negatives are not. If the filter says no, the element was never added.
-
Memory savings are massive. 16x or more compared to storing actual elements.
-
Real systems use them everywhere. Cassandra, HBase, Bigtable, Chrome, Akamai, Medium, and many more.
-
Choose parameters based on your tolerance. More memory means fewer false positives.
-
Standard Bloom filters do not support deletion. Use Counting Bloom Filters if you need to remove elements.
-
They are a building block, not a complete solution. Bloom filters work best as a first-line check before more expensive operations.
For more on probabilistic data structures, check out HyperLogLog for cardinality estimation and Count-Min Sketch for frequency counting. For related system design patterns, see How Stripe Prevents Double Payments and Caching Strategies Explained.
References: Bloom Filter on Wikipedia, Apache Cassandra Bloom Filters, Redis Bloom Filter