376: Probabilistic Bloom Filter
Difficulty: 3 Level: Advanced Space-efficient probabilistic set: false positives possible, false negatives impossible.The Problem This Solves
You're building a cache and want to avoid expensive lookups for keys that definitely aren't cached. Or you're a database checking whether a record exists before hitting disk. Or you're a web crawler tracking visited URLs. In each case, you don't need certainty โ you need a fast "probably yes / definitely no" answer. A `HashSet<T>` gives you exact answers but uses memory proportional to the number of elements. A bloom filter gives you probabilistic answers using a fixed, tiny amount of memory regardless of how many elements you add. A standard bloom filter with 1% false positive rate uses about 9.6 bits per element โ compare that to storing a 64-byte URL string (512 bits) in a hash set. The constraint: you can never remove elements from a standard bloom filter (removing would introduce false negatives). Counting bloom filters and cuckoo filters solve this but at higher cost.The Intuition
A bloom filter is a bit array of M bits plus K hash functions. To add an element: compute K hashes, set those K bit positions to 1. To check membership: compute K hashes, check if all K bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, it's probably in the set โ but a false positive is possible if those bits were set by other elements. The math: false positive probability โ `(1 - e^(-kn/m))^k` where n = elements inserted. You tune M (bits) and K (hash functions) based on your desired false positive rate and expected element count.How It Works in Rust
use bloomfilter::Bloom;
// Create a bloom filter for 1000 items with 1% false positive rate
let mut bloom = Bloom::new_for_fp_rate(1000, 0.01);
bloom.set("alice@example.com");
bloom.set("bob@example.com");
// Definitely not in set
assert!(!bloom.check("unknown@example.com"));
// Probably in set (could be false positive, never false negative)
assert!(bloom.check("alice@example.com"));
println!("Bitmap size: {} bytes", bloom.number_of_bits() / 8);
// Much smaller than storing the strings directly
Add to `Cargo.toml`: `bloomfilter = "1"`
For a manual implementation: allocate a `Vec<u64>` as a bit array, apply 2 hash functions with `k` seeds using double hashing: `hash_i = (h1 + i * h2) % m`.
What This Unlocks
- Cache guards โ check bloom filter before expensive DB/network lookup; skip on definite miss.
- Deduplication at scale โ web crawlers track billions of visited URLs in megabytes of bloom filter memory.
- Distributed systems โ CRDTs, gossip protocols, and anti-entropy use bloom filters to sync state efficiently.
Key Differences
| Concept | OCaml | Rust |
|---|---|---|
| Exact set | `Set` (balanced BST) or `Hashtbl` | `HashSet<T>` |
| Probabilistic set | No stdlib equivalent | `bloomfilter::Bloom` or manual bit array |
| Memory use | O(n) regardless | O(1) fixed size โ independent of element count |
| False negatives | Impossible in exact set | Impossible in bloom filter |
| False positives | Impossible in exact set | Tunable probability in bloom filter |
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
struct BloomFilter {
bits: Vec<u64>,
m: usize, // number of bits
k: usize, // number of hash functions
count: usize,
}
impl BloomFilter {
fn new(capacity: usize, fp_rate: f64) -> Self {
// m = -n*ln(p) / (ln2)^2
let m = (-(capacity as f64 * fp_rate.ln()) / (2f64.ln().powi(2))).ceil() as usize;
// k = (m/n) * ln2
let k = ((m as f64 / capacity as f64) * 2f64.ln()).ceil() as usize;
let m = m.max(64);
let k = k.max(1);
Self {
bits: vec![0u64; (m + 63) / 64],
m,
k,
count: 0,
}
}
fn hash_val<T: Hash>(&self, item: &T, seed: u64) -> usize {
let mut hasher = DefaultHasher::new();
seed.hash(&mut hasher);
item.hash(&mut hasher);
(hasher.finish() as usize) % self.m
}
fn set_bit(&mut self, i: usize) { self.bits[i/64] |= 1u64 << (i%64); }
fn get_bit(&self, i: usize) -> bool { (self.bits[i/64] >> (i%64)) & 1 == 1 }
fn insert<T: Hash>(&mut self, item: &T) {
for seed in 0..self.k as u64 {
let idx = self.hash_val(item, seed);
self.set_bit(idx);
}
self.count += 1;
}
fn contains<T: Hash>(&self, item: &T) -> bool {
(0..self.k as u64).all(|seed| self.get_bit(self.hash_val(item, seed)))
}
fn approx_false_positive_rate(&self) -> f64 {
let bits_set = self.bits.iter().map(|b| b.count_ones() as f64).sum::<f64>();
(bits_set / self.m as f64).powi(self.k as i32)
}
}
fn main() {
let mut bf = BloomFilter::new(1000, 0.01); // 1000 items, 1% FP rate
println!("m={} bits, k={} hash functions", bf.m, bf.k);
let words = ["alice","bob","charlie","diana","eve"];
for w in &words { bf.insert(w); }
for w in &words {
println!("{w}: {}", bf.contains(w)); // should all be true
}
let not_present = ["frank","grace","heidi","ivan","judy"];
let fps: usize = not_present.iter().filter(|&&w| bf.contains(&w)).count();
println!("False positives: {fps}/{} (expected ~0-1)", not_present.len());
println!("FP rate approx: {:.4}", bf.approx_false_positive_rate());
}
#[cfg(test)]
mod tests {
use super::*;
#[test] fn no_false_negatives() {
let mut bf = BloomFilter::new(100, 0.01);
for i in 0..50 { bf.insert(&i); }
for i in 0..50 { assert!(bf.contains(&i), "false negative for {i}"); }
}
#[test] fn not_inserted_usually_absent() {
let mut bf = BloomFilter::new(1000, 0.001);
for i in 0..100i32 { bf.insert(&i); }
// With very low FP rate, most non-members should be absent
let fp: usize = (1000..1100i32).filter(|x| bf.contains(x)).count();
assert!(fp < 10, "too many false positives: {fp}");
}
}
(* OCaml: bloom filter with bitarray *)
let m = 1024 (* bits *)
let bits = Bytes.make (m/8) '\000'
let set_bit i = Bytes.set bits (i/8) (Char.chr (Char.code (Bytes.get bits (i/8)) lor (1 lsl (i mod 8))))
let get_bit i = (Char.code (Bytes.get bits (i/8)) lsr (i mod 8)) land 1 = 1
let hash1 s = (Hashtbl.hash s) mod m
let hash2 s = (Hashtbl.hash (s ^ "salt")) mod m
let hash3 s = (Hashtbl.hash ("x" ^ s)) mod m
let add s = set_bit (hash1 s); set_bit (hash2 s); set_bit (hash3 s)
let might_contain s = get_bit (hash1 s) && get_bit (hash2 s) && get_bit (hash3 s)
let () =
List.iter add ["alice";"bob";"charlie"];
Printf.printf "alice: %b (should be true)\n" (might_contain "alice");
Printf.printf "dave: %b (should be false, maybe fp)\n" (might_contain "dave")