1165-building-a-huffman-tree-from-character-frequencies — Huffman Encoding: Greedy Tree Building
Tutorial
The Problem
Huffman coding, invented by David Huffman in 1952, is an optimal prefix-free encoding: common characters get short bit sequences, rare characters get long ones. It is used in ZIP, gzip, PNG, JPEG, and MP3 — virtually every compression format incorporates Huffman coding.
The algorithm builds a binary tree greedily using a priority queue: repeatedly merge the two lowest-frequency nodes until one tree remains. This greedy strategy provably produces the optimal prefix code (Shannon entropy bound).
🎯 Learning Outcomes
Code Example
pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
use std::collections::BinaryHeap;
struct Item { freq: u32, counter: usize, tree: HTree }
// Custom Ord reverses freq comparison → min-heap by frequency.
let mut heap: BinaryHeap<Item> = freqs.iter().enumerate()
.map(|(i, &(c, f))| Item { freq: f, counter: i, tree: HTree::Leaf(c, f) })
.collect();
loop {
let a = heap.pop()?;
let b = match heap.pop() { Some(b) => b, None => return Some(a.tree) };
let merged_freq = a.freq + b.freq;
let merged = HTree::Node(Box::new(a.tree), Box::new(b.tree), merged_freq);
heap.push(Item { freq: merged_freq, counter, tree: merged });
}
}Key Differences
BinaryHeap<Reverse<(freq, id, Node)>> with a tie-breaking counter; OCaml uses Set.Make with a custom comparator.HuffNode::Internal uses Box<HuffNode> for the children (recursive type); OCaml's huffnode is recursive without annotation.Ord requirement**: Rust's BinaryHeap requires Ord on the element type — the counter id breaks ties for equal frequencies; OCaml's Set comparator handles this directly.0, right branch appends 1.OCaml Approach
type huffnode =
| Leaf of char * int
| Internal of int * huffnode * huffnode
module PQ = Set.Make(struct
type t = int * int * huffnode
let compare (f1, i1, _) (f2, i2, _) = compare (f1, i1) (f2, i2)
end)
OCaml uses Set.Make as a priority queue (sorted by frequency), or the psq library for a proper priority queue. The algorithm structure is identical.
Full Source
#![allow(dead_code)]
//! Huffman Encoding: Greedy Tree Building
//! See example.ml for OCaml reference
//!
//! Builds a Huffman tree from character frequencies using a min-heap priority queue.
//! Characters with lower frequency get longer bit codes; higher frequency gets shorter codes.
use std::collections::HashMap;
/// A Huffman tree node — either a Leaf (character + frequency) or an internal Node.
#[derive(Debug, Clone)]
pub enum HTree {
Leaf(char, u32),
Node(Box<HTree>, Box<HTree>, u32),
}
impl HTree {
pub fn freq(&self) -> u32 {
match self {
HTree::Leaf(_, f) => *f,
HTree::Node(_, _, f) => *f,
}
}
}
/// Idiomatic Rust: build a Huffman tree from a (char, frequency) slice.
/// Uses a BinaryHeap with a struct wrapper (HTree is not Ord).
pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
use std::collections::BinaryHeap;
// Wrapper so BinaryHeap can order by (freq, counter) without requiring HTree: Ord.
struct Item {
freq: u32,
counter: usize,
tree: HTree,
}
impl PartialEq for Item {
fn eq(&self, other: &Self) -> bool {
self.freq == other.freq && self.counter == other.counter
}
}
impl Eq for Item {}
impl PartialOrd for Item {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}
impl Ord for Item {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
// Reverse on freq: lower frequency → higher priority (min-heap).
other
.freq
.cmp(&self.freq)
.then(other.counter.cmp(&self.counter))
}
}
let mut counter = freqs.len();
let mut heap: BinaryHeap<Item> = freqs
.iter()
.enumerate()
.map(|(i, &(c, f))| Item {
freq: f,
counter: i,
tree: HTree::Leaf(c, f),
})
.collect();
loop {
let a = heap.pop()?;
let b = match heap.pop() {
Some(item) => item,
None => return Some(a.tree),
};
let merged_freq = a.freq + b.freq;
let merged = HTree::Node(Box::new(a.tree), Box::new(b.tree), merged_freq);
heap.push(Item {
freq: merged_freq,
counter,
tree: merged,
});
counter += 1;
}
}
/// Functional/recursive: mirrors the OCaml sort-each-round approach.
pub fn build_tree_recursive(freqs: &[(char, u32)]) -> Option<HTree> {
let mut trees: Vec<HTree> = freqs.iter().map(|&(c, f)| HTree::Leaf(c, f)).collect();
trees.sort_by_key(|t| t.freq());
go(trees)
}
fn go(mut trees: Vec<HTree>) -> Option<HTree> {
match trees.len() {
0 => None,
1 => trees.into_iter().next(),
_ => {
let a = trees.remove(0);
let b = trees.remove(0);
let merged_freq = a.freq() + b.freq();
let merged = HTree::Node(Box::new(a), Box::new(b), merged_freq);
trees.push(merged);
trees.sort_by_key(|t| t.freq());
go(trees)
}
}
}
/// Walk the Huffman tree to collect (char, code) pairs.
/// Left = "0", right = "1".
pub fn codes(tree: &HTree, prefix: &str) -> Vec<(char, String)> {
match tree {
HTree::Leaf(c, _) => vec![(*c, prefix.to_owned())],
HTree::Node(left, right, _) => {
let mut result = codes(left, &format!("{prefix}0"));
result.extend(codes(right, &format!("{prefix}1")));
result
}
}
}
/// Build a Huffman tree from a text string (counts character frequencies automatically).
pub fn build_tree_from_text(text: &str) -> Option<HTree> {
let mut freq: HashMap<char, u32> = HashMap::new();
for c in text.chars() {
*freq.entry(c).or_insert(0) += 1;
}
let freqs: Vec<(char, u32)> = freq.into_iter().collect();
build_tree(&freqs)
}
#[cfg(test)]
mod tests {
use super::*;
fn sample_freqs() -> Vec<(char, u32)> {
vec![
('a', 5),
('b', 9),
('c', 12),
('d', 13),
('e', 16),
('f', 45),
]
}
#[test]
fn test_build_tree_empty_returns_none() {
assert!(build_tree(&[]).is_none());
assert!(build_tree_recursive(&[]).is_none());
}
#[test]
fn test_build_tree_single_char() {
let freqs = vec![('x', 7)];
let tree = build_tree(&freqs).expect("should build");
assert_eq!(tree.freq(), 7);
assert!(matches!(tree, HTree::Leaf('x', 7)));
}
#[test]
fn test_root_frequency_is_sum_of_all() {
let freqs = sample_freqs();
let total: u32 = freqs.iter().map(|(_, f)| f).sum();
let tree = build_tree(&freqs).expect("should build");
assert_eq!(tree.freq(), total);
}
#[test]
fn test_codes_covers_all_chars() {
let freqs = sample_freqs();
let tree = build_tree(&freqs).expect("should build");
let mut result = codes(&tree, "");
result.sort_by_key(|(c, _)| *c);
let chars: Vec<char> = result.iter().map(|(c, _)| *c).collect();
assert_eq!(chars, vec!['a', 'b', 'c', 'd', 'e', 'f']);
}
#[test]
fn test_codes_prefix_free() {
let freqs = sample_freqs();
let tree = build_tree(&freqs).expect("should build");
let result = codes(&tree, "");
// No code is a prefix of another in a valid Huffman tree.
for i in 0..result.len() {
for j in 0..result.len() {
if i != j {
assert!(
!result[j].1.starts_with(&result[i].1),
"code {} is a prefix of {}",
result[i].1,
result[j].1
);
}
}
}
}
#[test]
fn test_highest_freq_char_gets_shortest_code() {
// 'f' has freq 45 — more than all others combined — so it gets a 1-bit code.
let freqs = sample_freqs();
let tree = build_tree(&freqs).expect("should build");
let result = codes(&tree, "");
let f_len = result
.iter()
.find(|(c, _)| *c == 'f')
.map(|(_, code)| code.len())
.expect("f must have a code");
assert_eq!(f_len, 1, "'f' should get a 1-bit code");
}
#[test]
fn test_build_tree_from_text() {
let tree = build_tree_from_text("aabbbcccc");
assert!(tree.is_some());
let tree = tree.unwrap();
// Total frequency should equal string length.
assert_eq!(tree.freq(), 9);
}
#[test]
fn test_two_chars_get_one_bit_codes() {
let freqs = vec![('a', 3), ('b', 5)];
let tree = build_tree(&freqs).expect("should build");
let result = codes(&tree, "");
assert_eq!(result.len(), 2);
for (_, code) in &result {
assert_eq!(code.len(), 1);
}
}
}#[cfg(test)]
mod tests {
use super::*;
fn sample_freqs() -> Vec<(char, u32)> {
vec![
('a', 5),
('b', 9),
('c', 12),
('d', 13),
('e', 16),
('f', 45),
]
}
#[test]
fn test_build_tree_empty_returns_none() {
assert!(build_tree(&[]).is_none());
assert!(build_tree_recursive(&[]).is_none());
}
#[test]
fn test_build_tree_single_char() {
let freqs = vec![('x', 7)];
let tree = build_tree(&freqs).expect("should build");
assert_eq!(tree.freq(), 7);
assert!(matches!(tree, HTree::Leaf('x', 7)));
}
#[test]
fn test_root_frequency_is_sum_of_all() {
let freqs = sample_freqs();
let total: u32 = freqs.iter().map(|(_, f)| f).sum();
let tree = build_tree(&freqs).expect("should build");
assert_eq!(tree.freq(), total);
}
#[test]
fn test_codes_covers_all_chars() {
let freqs = sample_freqs();
let tree = build_tree(&freqs).expect("should build");
let mut result = codes(&tree, "");
result.sort_by_key(|(c, _)| *c);
let chars: Vec<char> = result.iter().map(|(c, _)| *c).collect();
assert_eq!(chars, vec!['a', 'b', 'c', 'd', 'e', 'f']);
}
#[test]
fn test_codes_prefix_free() {
let freqs = sample_freqs();
let tree = build_tree(&freqs).expect("should build");
let result = codes(&tree, "");
// No code is a prefix of another in a valid Huffman tree.
for i in 0..result.len() {
for j in 0..result.len() {
if i != j {
assert!(
!result[j].1.starts_with(&result[i].1),
"code {} is a prefix of {}",
result[i].1,
result[j].1
);
}
}
}
}
#[test]
fn test_highest_freq_char_gets_shortest_code() {
// 'f' has freq 45 — more than all others combined — so it gets a 1-bit code.
let freqs = sample_freqs();
let tree = build_tree(&freqs).expect("should build");
let result = codes(&tree, "");
let f_len = result
.iter()
.find(|(c, _)| *c == 'f')
.map(|(_, code)| code.len())
.expect("f must have a code");
assert_eq!(f_len, 1, "'f' should get a 1-bit code");
}
#[test]
fn test_build_tree_from_text() {
let tree = build_tree_from_text("aabbbcccc");
assert!(tree.is_some());
let tree = tree.unwrap();
// Total frequency should equal string length.
assert_eq!(tree.freq(), 9);
}
#[test]
fn test_two_chars_get_one_bit_codes() {
let freqs = vec![('a', 3), ('b', 5)];
let tree = build_tree(&freqs).expect("should build");
let result = codes(&tree, "");
assert_eq!(result.len(), 2);
for (_, code) in &result {
assert_eq!(code.len(), 1);
}
}
}
Deep Comparison
OCaml vs Rust: Huffman Encoding — Greedy Tree Building
Side-by-Side Code
OCaml
type htree = Leaf of char * int | Node of htree * htree * int
let freq t = match t with Leaf (_,f) -> f | Node (_,_,f) -> f
let build_tree freqs =
let trees = List.map (fun (c,f) -> Leaf (c,f)) freqs
|> List.sort (fun a b -> compare (freq a) (freq b)) in
let rec go = function
| [t] -> t
| a :: b :: rest ->
let merged = Node (a, b, freq a + freq b) in
let trees = List.sort (fun a b -> compare (freq a) (freq b)) (merged :: rest) in
go trees
| [] -> failwith "empty"
in go trees
let rec codes prefix = function
| Leaf (c, _) -> [(c, prefix)]
| Node (l, r, _) -> codes (prefix ^ "0") l @ codes (prefix ^ "1") r
Rust (idiomatic — BinaryHeap min-heap)
pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
use std::collections::BinaryHeap;
struct Item { freq: u32, counter: usize, tree: HTree }
// Custom Ord reverses freq comparison → min-heap by frequency.
let mut heap: BinaryHeap<Item> = freqs.iter().enumerate()
.map(|(i, &(c, f))| Item { freq: f, counter: i, tree: HTree::Leaf(c, f) })
.collect();
loop {
let a = heap.pop()?;
let b = match heap.pop() { Some(b) => b, None => return Some(a.tree) };
let merged_freq = a.freq + b.freq;
let merged = HTree::Node(Box::new(a.tree), Box::new(b.tree), merged_freq);
heap.push(Item { freq: merged_freq, counter, tree: merged });
}
}
Rust (functional/recursive — sort-each-round)
pub fn build_tree_recursive(freqs: &[(char, u32)]) -> Option<HTree> {
let mut trees: Vec<HTree> = freqs.iter()
.map(|&(c, f)| HTree::Leaf(c, f)).collect();
trees.sort_by_key(|t| t.freq());
go(trees)
}
fn go(mut trees: Vec<HTree>) -> Option<HTree> {
match trees.len() {
0 => None,
1 => trees.into_iter().next(),
_ => {
let a = trees.remove(0);
let b = trees.remove(0);
let merged_freq = a.freq() + b.freq();
let merged = HTree::Node(Box::new(a), Box::new(b), merged_freq);
trees.push(merged);
trees.sort_by_key(|t| t.freq());
go(trees)
}
}
}
Type Signatures
| Concept | OCaml | Rust |
|---|---|---|
| Tree type | type htree = Leaf of char * int \| Node of htree * htree * int | enum HTree { Leaf(char, u32), Node(Box<HTree>, Box<HTree>, u32) } |
| Recursive type | automatic (GC allocates) | requires Box<HTree> to break the size cycle |
| Priority queue | List.sort each round (O(n² log n)) | BinaryHeap (O(n log n)) |
| Error handling | failwith "empty" (exception) | Option<HTree> (type-safe) |
| Code result | (char * string) list | Vec<(char, String)> |
Key Insights
Box<HTree> to break the infinite-size cycle because the compiler must know the size of every type at compile time.BinaryHeap maintains heap order, giving O(n log n) overall.BinaryHeap requires Ord on its elements. Since HTree has no natural ordering, a wrapper struct with a custom Ord (ordering only by frequency and a tie-breaking counter) is needed.failwith "empty" for invalid input; Rust returns Option<HTree>, forcing callers to handle the empty case at compile time rather than at runtime.When to Use Each Style
**Use BinaryHeap when:** performance matters — O(n log n) vs O(n² log n) for large frequency tables.
Use recursive sort-each-round when: clarity and the OCaml parallel are the goal — the algorithm structure maps one-to-one with the OCaml source and is easier to verify by inspection.
Exercises
generate_codes(tree: &HuffNode) -> HashMap<char, String> that produces the codebook by walking the tree.encode(text: &str, codes: &HashMap<char, String>) -> String and decode(bits: &str, tree: &HuffNode) -> String.