ExamplesBy LevelBy TopicLearning Paths
1165 Advanced

1165-building-a-huffman-tree-from-character-frequencies — Huffman Encoding: Greedy Tree Building

Functional Programming

Tutorial

The Problem

Huffman coding, invented by David Huffman in 1952, is an optimal prefix-free encoding: common characters get short bit sequences, rare characters get long ones. It is used in ZIP, gzip, PNG, JPEG, and MP3 — virtually every compression format incorporates Huffman coding.

The algorithm builds a binary tree greedily using a priority queue: repeatedly merge the two lowest-frequency nodes until one tree remains. This greedy strategy provably produces the optimal prefix code (Shannon entropy bound).

🎯 Learning Outcomes

  • • Build a Huffman tree using a min-heap (priority queue)
  • • Understand the greedy merge strategy: always merge the two smallest
  • • Generate prefix-free codes by walking the tree (left=0, right=1)
  • • Encode and decode text using the generated codebook
  • • Understand why Huffman codes are optimal (proof by exchange argument)
  • Code Example

    pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
        use std::collections::BinaryHeap;
    
        struct Item { freq: u32, counter: usize, tree: HTree }
        // Custom Ord reverses freq comparison → min-heap by frequency.
    
        let mut heap: BinaryHeap<Item> = freqs.iter().enumerate()
            .map(|(i, &(c, f))| Item { freq: f, counter: i, tree: HTree::Leaf(c, f) })
            .collect();
        loop {
            let a = heap.pop()?;
            let b = match heap.pop() { Some(b) => b, None => return Some(a.tree) };
            let merged_freq = a.freq + b.freq;
            let merged = HTree::Node(Box::new(a.tree), Box::new(b.tree), merged_freq);
            heap.push(Item { freq: merged_freq, counter, tree: merged });
        }
    }

    Key Differences

  • Priority queue: Rust uses BinaryHeap<Reverse<(freq, id, Node)>> with a tie-breaking counter; OCaml uses Set.Make with a custom comparator.
  • Recursive type: Rust's HuffNode::Internal uses Box<HuffNode> for the children (recursive type); OCaml's huffnode is recursive without annotation.
  • **Ord requirement**: Rust's BinaryHeap requires Ord on the element type — the counter id breaks ties for equal frequencies; OCaml's Set comparator handles this directly.
  • Codebook generation: Both walk the tree recursively, passing a prefix bit string down: left branch appends 0, right branch appends 1.
  • OCaml Approach

    type huffnode =
      | Leaf of char * int
      | Internal of int * huffnode * huffnode
    
    module PQ = Set.Make(struct
      type t = int * int * huffnode
      let compare (f1, i1, _) (f2, i2, _) = compare (f1, i1) (f2, i2)
    end)
    

    OCaml uses Set.Make as a priority queue (sorted by frequency), or the psq library for a proper priority queue. The algorithm structure is identical.

    Full Source

    #![allow(dead_code)]
    //! Huffman Encoding: Greedy Tree Building
    //! See example.ml for OCaml reference
    //!
    //! Builds a Huffman tree from character frequencies using a min-heap priority queue.
    //! Characters with lower frequency get longer bit codes; higher frequency gets shorter codes.
    
    use std::collections::HashMap;
    
    /// A Huffman tree node — either a Leaf (character + frequency) or an internal Node.
    #[derive(Debug, Clone)]
    pub enum HTree {
        Leaf(char, u32),
        Node(Box<HTree>, Box<HTree>, u32),
    }
    
    impl HTree {
        pub fn freq(&self) -> u32 {
            match self {
                HTree::Leaf(_, f) => *f,
                HTree::Node(_, _, f) => *f,
            }
        }
    }
    
    /// Idiomatic Rust: build a Huffman tree from a (char, frequency) slice.
    /// Uses a BinaryHeap with a struct wrapper (HTree is not Ord).
    pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
        use std::collections::BinaryHeap;
    
        // Wrapper so BinaryHeap can order by (freq, counter) without requiring HTree: Ord.
        struct Item {
            freq: u32,
            counter: usize,
            tree: HTree,
        }
        impl PartialEq for Item {
            fn eq(&self, other: &Self) -> bool {
                self.freq == other.freq && self.counter == other.counter
            }
        }
        impl Eq for Item {}
        impl PartialOrd for Item {
            fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
                Some(self.cmp(other))
            }
        }
        impl Ord for Item {
            fn cmp(&self, other: &Self) -> std::cmp::Ordering {
                // Reverse on freq: lower frequency → higher priority (min-heap).
                other
                    .freq
                    .cmp(&self.freq)
                    .then(other.counter.cmp(&self.counter))
            }
        }
    
        let mut counter = freqs.len();
        let mut heap: BinaryHeap<Item> = freqs
            .iter()
            .enumerate()
            .map(|(i, &(c, f))| Item {
                freq: f,
                counter: i,
                tree: HTree::Leaf(c, f),
            })
            .collect();
    
        loop {
            let a = heap.pop()?;
            let b = match heap.pop() {
                Some(item) => item,
                None => return Some(a.tree),
            };
            let merged_freq = a.freq + b.freq;
            let merged = HTree::Node(Box::new(a.tree), Box::new(b.tree), merged_freq);
            heap.push(Item {
                freq: merged_freq,
                counter,
                tree: merged,
            });
            counter += 1;
        }
    }
    
    /// Functional/recursive: mirrors the OCaml sort-each-round approach.
    pub fn build_tree_recursive(freqs: &[(char, u32)]) -> Option<HTree> {
        let mut trees: Vec<HTree> = freqs.iter().map(|&(c, f)| HTree::Leaf(c, f)).collect();
        trees.sort_by_key(|t| t.freq());
        go(trees)
    }
    
    fn go(mut trees: Vec<HTree>) -> Option<HTree> {
        match trees.len() {
            0 => None,
            1 => trees.into_iter().next(),
            _ => {
                let a = trees.remove(0);
                let b = trees.remove(0);
                let merged_freq = a.freq() + b.freq();
                let merged = HTree::Node(Box::new(a), Box::new(b), merged_freq);
                trees.push(merged);
                trees.sort_by_key(|t| t.freq());
                go(trees)
            }
        }
    }
    
    /// Walk the Huffman tree to collect (char, code) pairs.
    /// Left = "0", right = "1".
    pub fn codes(tree: &HTree, prefix: &str) -> Vec<(char, String)> {
        match tree {
            HTree::Leaf(c, _) => vec![(*c, prefix.to_owned())],
            HTree::Node(left, right, _) => {
                let mut result = codes(left, &format!("{prefix}0"));
                result.extend(codes(right, &format!("{prefix}1")));
                result
            }
        }
    }
    
    /// Build a Huffman tree from a text string (counts character frequencies automatically).
    pub fn build_tree_from_text(text: &str) -> Option<HTree> {
        let mut freq: HashMap<char, u32> = HashMap::new();
        for c in text.chars() {
            *freq.entry(c).or_insert(0) += 1;
        }
        let freqs: Vec<(char, u32)> = freq.into_iter().collect();
        build_tree(&freqs)
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        fn sample_freqs() -> Vec<(char, u32)> {
            vec![
                ('a', 5),
                ('b', 9),
                ('c', 12),
                ('d', 13),
                ('e', 16),
                ('f', 45),
            ]
        }
    
        #[test]
        fn test_build_tree_empty_returns_none() {
            assert!(build_tree(&[]).is_none());
            assert!(build_tree_recursive(&[]).is_none());
        }
    
        #[test]
        fn test_build_tree_single_char() {
            let freqs = vec![('x', 7)];
            let tree = build_tree(&freqs).expect("should build");
            assert_eq!(tree.freq(), 7);
            assert!(matches!(tree, HTree::Leaf('x', 7)));
        }
    
        #[test]
        fn test_root_frequency_is_sum_of_all() {
            let freqs = sample_freqs();
            let total: u32 = freqs.iter().map(|(_, f)| f).sum();
            let tree = build_tree(&freqs).expect("should build");
            assert_eq!(tree.freq(), total);
        }
    
        #[test]
        fn test_codes_covers_all_chars() {
            let freqs = sample_freqs();
            let tree = build_tree(&freqs).expect("should build");
            let mut result = codes(&tree, "");
            result.sort_by_key(|(c, _)| *c);
            let chars: Vec<char> = result.iter().map(|(c, _)| *c).collect();
            assert_eq!(chars, vec!['a', 'b', 'c', 'd', 'e', 'f']);
        }
    
        #[test]
        fn test_codes_prefix_free() {
            let freqs = sample_freqs();
            let tree = build_tree(&freqs).expect("should build");
            let result = codes(&tree, "");
            // No code is a prefix of another in a valid Huffman tree.
            for i in 0..result.len() {
                for j in 0..result.len() {
                    if i != j {
                        assert!(
                            !result[j].1.starts_with(&result[i].1),
                            "code {} is a prefix of {}",
                            result[i].1,
                            result[j].1
                        );
                    }
                }
            }
        }
    
        #[test]
        fn test_highest_freq_char_gets_shortest_code() {
            // 'f' has freq 45 — more than all others combined — so it gets a 1-bit code.
            let freqs = sample_freqs();
            let tree = build_tree(&freqs).expect("should build");
            let result = codes(&tree, "");
            let f_len = result
                .iter()
                .find(|(c, _)| *c == 'f')
                .map(|(_, code)| code.len())
                .expect("f must have a code");
            assert_eq!(f_len, 1, "'f' should get a 1-bit code");
        }
    
        #[test]
        fn test_build_tree_from_text() {
            let tree = build_tree_from_text("aabbbcccc");
            assert!(tree.is_some());
            let tree = tree.unwrap();
            // Total frequency should equal string length.
            assert_eq!(tree.freq(), 9);
        }
    
        #[test]
        fn test_two_chars_get_one_bit_codes() {
            let freqs = vec![('a', 3), ('b', 5)];
            let tree = build_tree(&freqs).expect("should build");
            let result = codes(&tree, "");
            assert_eq!(result.len(), 2);
            for (_, code) in &result {
                assert_eq!(code.len(), 1);
            }
        }
    }
    ✓ Tests Rust test suite
    #[cfg(test)]
    mod tests {
        use super::*;
    
        fn sample_freqs() -> Vec<(char, u32)> {
            vec![
                ('a', 5),
                ('b', 9),
                ('c', 12),
                ('d', 13),
                ('e', 16),
                ('f', 45),
            ]
        }
    
        #[test]
        fn test_build_tree_empty_returns_none() {
            assert!(build_tree(&[]).is_none());
            assert!(build_tree_recursive(&[]).is_none());
        }
    
        #[test]
        fn test_build_tree_single_char() {
            let freqs = vec![('x', 7)];
            let tree = build_tree(&freqs).expect("should build");
            assert_eq!(tree.freq(), 7);
            assert!(matches!(tree, HTree::Leaf('x', 7)));
        }
    
        #[test]
        fn test_root_frequency_is_sum_of_all() {
            let freqs = sample_freqs();
            let total: u32 = freqs.iter().map(|(_, f)| f).sum();
            let tree = build_tree(&freqs).expect("should build");
            assert_eq!(tree.freq(), total);
        }
    
        #[test]
        fn test_codes_covers_all_chars() {
            let freqs = sample_freqs();
            let tree = build_tree(&freqs).expect("should build");
            let mut result = codes(&tree, "");
            result.sort_by_key(|(c, _)| *c);
            let chars: Vec<char> = result.iter().map(|(c, _)| *c).collect();
            assert_eq!(chars, vec!['a', 'b', 'c', 'd', 'e', 'f']);
        }
    
        #[test]
        fn test_codes_prefix_free() {
            let freqs = sample_freqs();
            let tree = build_tree(&freqs).expect("should build");
            let result = codes(&tree, "");
            // No code is a prefix of another in a valid Huffman tree.
            for i in 0..result.len() {
                for j in 0..result.len() {
                    if i != j {
                        assert!(
                            !result[j].1.starts_with(&result[i].1),
                            "code {} is a prefix of {}",
                            result[i].1,
                            result[j].1
                        );
                    }
                }
            }
        }
    
        #[test]
        fn test_highest_freq_char_gets_shortest_code() {
            // 'f' has freq 45 — more than all others combined — so it gets a 1-bit code.
            let freqs = sample_freqs();
            let tree = build_tree(&freqs).expect("should build");
            let result = codes(&tree, "");
            let f_len = result
                .iter()
                .find(|(c, _)| *c == 'f')
                .map(|(_, code)| code.len())
                .expect("f must have a code");
            assert_eq!(f_len, 1, "'f' should get a 1-bit code");
        }
    
        #[test]
        fn test_build_tree_from_text() {
            let tree = build_tree_from_text("aabbbcccc");
            assert!(tree.is_some());
            let tree = tree.unwrap();
            // Total frequency should equal string length.
            assert_eq!(tree.freq(), 9);
        }
    
        #[test]
        fn test_two_chars_get_one_bit_codes() {
            let freqs = vec![('a', 3), ('b', 5)];
            let tree = build_tree(&freqs).expect("should build");
            let result = codes(&tree, "");
            assert_eq!(result.len(), 2);
            for (_, code) in &result {
                assert_eq!(code.len(), 1);
            }
        }
    }

    Deep Comparison

    OCaml vs Rust: Huffman Encoding — Greedy Tree Building

    Side-by-Side Code

    OCaml

    type htree = Leaf of char * int | Node of htree * htree * int
    
    let freq t = match t with Leaf (_,f) -> f | Node (_,_,f) -> f
    
    let build_tree freqs =
      let trees = List.map (fun (c,f) -> Leaf (c,f)) freqs
        |> List.sort (fun a b -> compare (freq a) (freq b)) in
      let rec go = function
        | [t] -> t
        | a :: b :: rest ->
          let merged = Node (a, b, freq a + freq b) in
          let trees = List.sort (fun a b -> compare (freq a) (freq b)) (merged :: rest) in
          go trees
        | [] -> failwith "empty"
      in go trees
    
    let rec codes prefix = function
      | Leaf (c, _) -> [(c, prefix)]
      | Node (l, r, _) -> codes (prefix ^ "0") l @ codes (prefix ^ "1") r
    

    Rust (idiomatic — BinaryHeap min-heap)

    pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
        use std::collections::BinaryHeap;
    
        struct Item { freq: u32, counter: usize, tree: HTree }
        // Custom Ord reverses freq comparison → min-heap by frequency.
    
        let mut heap: BinaryHeap<Item> = freqs.iter().enumerate()
            .map(|(i, &(c, f))| Item { freq: f, counter: i, tree: HTree::Leaf(c, f) })
            .collect();
        loop {
            let a = heap.pop()?;
            let b = match heap.pop() { Some(b) => b, None => return Some(a.tree) };
            let merged_freq = a.freq + b.freq;
            let merged = HTree::Node(Box::new(a.tree), Box::new(b.tree), merged_freq);
            heap.push(Item { freq: merged_freq, counter, tree: merged });
        }
    }
    

    Rust (functional/recursive — sort-each-round)

    pub fn build_tree_recursive(freqs: &[(char, u32)]) -> Option<HTree> {
        let mut trees: Vec<HTree> = freqs.iter()
            .map(|&(c, f)| HTree::Leaf(c, f)).collect();
        trees.sort_by_key(|t| t.freq());
        go(trees)
    }
    
    fn go(mut trees: Vec<HTree>) -> Option<HTree> {
        match trees.len() {
            0 => None,
            1 => trees.into_iter().next(),
            _ => {
                let a = trees.remove(0);
                let b = trees.remove(0);
                let merged_freq = a.freq() + b.freq();
                let merged = HTree::Node(Box::new(a), Box::new(b), merged_freq);
                trees.push(merged);
                trees.sort_by_key(|t| t.freq());
                go(trees)
            }
        }
    }
    

    Type Signatures

    ConceptOCamlRust
    Tree typetype htree = Leaf of char * int \| Node of htree * htree * intenum HTree { Leaf(char, u32), Node(Box<HTree>, Box<HTree>, u32) }
    Recursive typeautomatic (GC allocates)requires Box<HTree> to break the size cycle
    Priority queueList.sort each round (O(n² log n))BinaryHeap (O(n log n))
    Error handlingfailwith "empty" (exception)Option<HTree> (type-safe)
    Code result(char * string) listVec<(char, String)>

    Key Insights

  • Recursive type: OCaml allows recursive ADTs directly; Rust requires Box<HTree> to break the infinite-size cycle because the compiler must know the size of every type at compile time.
  • Priority queue: OCaml's functional approach re-sorts the entire list each round (O(n² log n)); Rust's BinaryHeap maintains heap order, giving O(n log n) overall.
  • Ord requirement: Rust's BinaryHeap requires Ord on its elements. Since HTree has no natural ordering, a wrapper struct with a custom Ord (ordering only by frequency and a tie-breaking counter) is needed.
  • Error handling: OCaml raises failwith "empty" for invalid input; Rust returns Option<HTree>, forcing callers to handle the empty case at compile time rather than at runtime.
  • Prefix-free codes: Both implementations produce valid Huffman codes where no code is a prefix of another, because characters appear only in leaves of the tree.
  • When to Use Each Style

    **Use BinaryHeap when:** performance matters — O(n log n) vs O(n² log n) for large frequency tables. Use recursive sort-each-round when: clarity and the OCaml parallel are the goal — the algorithm structure maps one-to-one with the OCaml source and is easier to verify by inspection.

    Exercises

  • Implement generate_codes(tree: &HuffNode) -> HashMap<char, String> that produces the codebook by walking the tree.
  • Write encode(text: &str, codes: &HashMap<char, String>) -> String and decode(bits: &str, tree: &HuffNode) -> String.
  • Compute the compression ratio: compare the number of bits in the Huffman encoding to the 8-bit ASCII encoding of the same text.
  • Open Source Repos