๐Ÿฆ€ Functional Rust

376: Probabilistic Bloom Filter

Difficulty: 3 Level: Advanced Space-efficient probabilistic set: false positives possible, false negatives impossible.

The Problem This Solves

You're building a cache and want to avoid expensive lookups for keys that definitely aren't cached. Or you're a database checking whether a record exists before hitting disk. Or you're a web crawler tracking visited URLs. In each case, you don't need certainty โ€” you need a fast "probably yes / definitely no" answer. A `HashSet<T>` gives you exact answers but uses memory proportional to the number of elements. A bloom filter gives you probabilistic answers using a fixed, tiny amount of memory regardless of how many elements you add. A standard bloom filter with 1% false positive rate uses about 9.6 bits per element โ€” compare that to storing a 64-byte URL string (512 bits) in a hash set. The constraint: you can never remove elements from a standard bloom filter (removing would introduce false negatives). Counting bloom filters and cuckoo filters solve this but at higher cost.

The Intuition

A bloom filter is a bit array of M bits plus K hash functions. To add an element: compute K hashes, set those K bit positions to 1. To check membership: compute K hashes, check if all K bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, it's probably in the set โ€” but a false positive is possible if those bits were set by other elements. The math: false positive probability โ‰ˆ `(1 - e^(-kn/m))^k` where n = elements inserted. You tune M (bits) and K (hash functions) based on your desired false positive rate and expected element count.

How It Works in Rust

use bloomfilter::Bloom;

// Create a bloom filter for 1000 items with 1% false positive rate
let mut bloom = Bloom::new_for_fp_rate(1000, 0.01);

bloom.set("alice@example.com");
bloom.set("bob@example.com");

// Definitely not in set
assert!(!bloom.check("unknown@example.com"));

// Probably in set (could be false positive, never false negative)
assert!(bloom.check("alice@example.com"));

println!("Bitmap size: {} bytes", bloom.number_of_bits() / 8);
// Much smaller than storing the strings directly
Add to `Cargo.toml`: `bloomfilter = "1"` For a manual implementation: allocate a `Vec<u64>` as a bit array, apply 2 hash functions with `k` seeds using double hashing: `hash_i = (h1 + i * h2) % m`.

What This Unlocks

Key Differences

ConceptOCamlRust
Exact set`Set` (balanced BST) or `Hashtbl``HashSet<T>`
Probabilistic setNo stdlib equivalent`bloomfilter::Bloom` or manual bit array
Memory useO(n) regardlessO(1) fixed size โ€” independent of element count
False negativesImpossible in exact setImpossible in bloom filter
False positivesImpossible in exact setTunable probability in bloom filter
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};

struct BloomFilter {
    bits: Vec<u64>,
    m: usize, // number of bits
    k: usize, // number of hash functions
    count: usize,
}

impl BloomFilter {
    fn new(capacity: usize, fp_rate: f64) -> Self {
        // m = -n*ln(p) / (ln2)^2
        let m = (-(capacity as f64 * fp_rate.ln()) / (2f64.ln().powi(2))).ceil() as usize;
        // k = (m/n) * ln2
        let k = ((m as f64 / capacity as f64) * 2f64.ln()).ceil() as usize;
        let m = m.max(64);
        let k = k.max(1);
        Self {
            bits: vec![0u64; (m + 63) / 64],
            m,
            k,
            count: 0,
        }
    }

    fn hash_val<T: Hash>(&self, item: &T, seed: u64) -> usize {
        let mut hasher = DefaultHasher::new();
        seed.hash(&mut hasher);
        item.hash(&mut hasher);
        (hasher.finish() as usize) % self.m
    }

    fn set_bit(&mut self, i: usize) { self.bits[i/64] |= 1u64 << (i%64); }
    fn get_bit(&self, i: usize) -> bool { (self.bits[i/64] >> (i%64)) & 1 == 1 }

    fn insert<T: Hash>(&mut self, item: &T) {
        for seed in 0..self.k as u64 {
            let idx = self.hash_val(item, seed);
            self.set_bit(idx);
        }
        self.count += 1;
    }

    fn contains<T: Hash>(&self, item: &T) -> bool {
        (0..self.k as u64).all(|seed| self.get_bit(self.hash_val(item, seed)))
    }

    fn approx_false_positive_rate(&self) -> f64 {
        let bits_set = self.bits.iter().map(|b| b.count_ones() as f64).sum::<f64>();
        (bits_set / self.m as f64).powi(self.k as i32)
    }
}

fn main() {
    let mut bf = BloomFilter::new(1000, 0.01); // 1000 items, 1% FP rate
    println!("m={} bits, k={} hash functions", bf.m, bf.k);

    let words = ["alice","bob","charlie","diana","eve"];
    for w in &words { bf.insert(w); }

    for w in &words {
        println!("{w}: {}", bf.contains(w)); // should all be true
    }

    let not_present = ["frank","grace","heidi","ivan","judy"];
    let fps: usize = not_present.iter().filter(|&&w| bf.contains(&w)).count();
    println!("False positives: {fps}/{} (expected ~0-1)", not_present.len());
    println!("FP rate approx: {:.4}", bf.approx_false_positive_rate());
}

#[cfg(test)]
mod tests {
    use super::*;
    #[test] fn no_false_negatives() {
        let mut bf = BloomFilter::new(100, 0.01);
        for i in 0..50 { bf.insert(&i); }
        for i in 0..50 { assert!(bf.contains(&i), "false negative for {i}"); }
    }
    #[test] fn not_inserted_usually_absent() {
        let mut bf = BloomFilter::new(1000, 0.001);
        for i in 0..100i32 { bf.insert(&i); }
        // With very low FP rate, most non-members should be absent
        let fp: usize = (1000..1100i32).filter(|x| bf.contains(x)).count();
        assert!(fp < 10, "too many false positives: {fp}");
    }
}
(* OCaml: bloom filter with bitarray *)

let m = 1024  (* bits *)
let bits = Bytes.make (m/8) '\000'

let set_bit i = Bytes.set bits (i/8) (Char.chr (Char.code (Bytes.get bits (i/8)) lor (1 lsl (i mod 8))))
let get_bit i = (Char.code (Bytes.get bits (i/8)) lsr (i mod 8)) land 1 = 1

let hash1 s = (Hashtbl.hash s) mod m
let hash2 s = (Hashtbl.hash (s ^ "salt")) mod m
let hash3 s = (Hashtbl.hash ("x" ^ s)) mod m

let add s = set_bit (hash1 s); set_bit (hash2 s); set_bit (hash3 s)
let might_contain s = get_bit (hash1 s) && get_bit (hash2 s) && get_bit (hash3 s)

let () =
  List.iter add ["alice";"bob";"charlie"];
  Printf.printf "alice: %b (should be true)\n" (might_contain "alice");
  Printf.printf "dave: %b (should be false, maybe fp)\n" (might_contain "dave")