Huffman Encoding
Tutorial
The Problem
Build an optimal prefix-free binary code for a set of characters with given frequencies using Huffman's greedy algorithm: repeatedly merge the two lowest-frequency nodes until a single tree remains, then assign "0"/"1" on left/right branches to derive each character's code.
🎯 Learning Outcomes
type htree = Leaf | Node) into an owned Rust enum with Box<HTree> children
BinaryHeap<Reverse<W>> for a min-heap without unsafe or external cratesOrd wrapper (FreqOrd) keeps ordering logic separate from thedata type
Vec with remove(0) in Rust,and why the BinaryHeap version is the idiomatic upgrade
🦀 The Rust Way
The idiomatic Rust solution wraps each HTree in a FreqOrd newtype that
implements Ord by frequency, then uses BinaryHeap<Reverse<FreqOrd>> to get
O(log n) insert/remove instead of O(n) re-sort. The recursive solution mirrors
OCaml's list-sort style exactly, trading efficiency for directness. Both share
the same codes traversal function.
Code Example
use std::cmp::{Ordering, Reverse};
use std::collections::BinaryHeap;
pub enum HTree {
Leaf(char, u32),
Node(Box<HTree>, Box<HTree>, u32),
}
// Newtype that orders by frequency only, enabling min-heap via Reverse<FreqOrd>
struct FreqOrd(HTree);
impl Ord for FreqOrd {
fn cmp(&self, other: &Self) -> Ordering {
self.0.freq().cmp(&other.0.freq())
}
}
pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
let mut heap: BinaryHeap<Reverse<FreqOrd>> = freqs
.iter()
.map(|&(c, f)| Reverse(FreqOrd(HTree::Leaf(c, f))))
.collect();
while heap.len() > 1 {
let Reverse(FreqOrd(a)) = heap.pop()?;
let Reverse(FreqOrd(b)) = heap.pop()?;
let freq = a.freq() + b.freq();
heap.push(Reverse(FreqOrd(HTree::Node(Box::new(a), Box::new(b), freq))));
}
heap.pop().map(|Reverse(FreqOrd(t))| t)
}Key Differences
type htree = ⦠| Node of htree * htree * int is natively recursive; Rust requires Box<HTree> to give the enum a known
size on the stack.
List.sort re-sorts each step; Rust offers BinaryHeap (max-heap) which becomes a min-heap via Reverse<W>.
Rust externalises ordering into a newtype (FreqOrd) implementing the Ord
trait, keeping HTree itself free of ordering concerns.
^ (string concat) naturally in a recursive call; Rust uses format!("{prefix}0") passing an owned String down the
call stack to avoid repeated allocations at the top level.
OCaml Approach
OCaml represents the tree as a sum type and builds it by sorting a list on each
iteration. List.sort re-runs on every merge step, giving O(n² log n) total
work, but the code is compact and transparent. Code generation is a natural
structural recursion over the tree, concatenating prefix strings.
Full Source
#![allow(dead_code)]
//! 101: Huffman Encoding ā Greedy Tree Building
//! Builds an optimal prefix-free code tree by repeatedly merging the two
//! lowest-frequency nodes, mirroring the OCaml pattern-match / list-sort approach.
use std::cmp::{Ordering, Reverse};
use std::collections::BinaryHeap;
// āā Tree type āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
#[derive(Debug)]
pub enum HTree {
Leaf(char, u32),
Node(Box<HTree>, Box<HTree>, u32),
}
impl HTree {
pub fn freq(&self) -> u32 {
match self {
HTree::Leaf(_, f) | HTree::Node(_, _, f) => *f,
}
}
}
// Wrapper so BinaryHeap<Reverse<FreqOrd>> acts as a min-heap by frequency.
// All four Ord traits must be consistent; we use frequency only.
struct FreqOrd(HTree);
impl PartialEq for FreqOrd {
fn eq(&self, other: &Self) -> bool {
self.0.freq() == other.0.freq()
}
}
impl Eq for FreqOrd {}
impl Ord for FreqOrd {
fn cmp(&self, other: &Self) -> Ordering {
self.0.freq().cmp(&other.0.freq())
}
}
impl PartialOrd for FreqOrd {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(other))
}
}
// āā Solution 1: Idiomatic Rust ā BinaryHeap min-heap āāāāāāāāāāāāāāāāāāāāāāāā
//
// Uses std's BinaryHeap (max-heap) wrapped with Reverse to get a min-heap.
// Each iteration pops the two cheapest nodes and pushes their merged parent.
// O(n log n) ā same asymptotic cost as the OCaml sort-per-step version but
// with O(log n) insert/remove instead of O(n) re-sort.
pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
if freqs.is_empty() {
return None;
}
let mut heap: BinaryHeap<Reverse<FreqOrd>> = freqs
.iter()
.map(|&(c, f)| Reverse(FreqOrd(HTree::Leaf(c, f))))
.collect();
while heap.len() > 1 {
let Reverse(FreqOrd(a)) = heap.pop()?;
let Reverse(FreqOrd(b)) = heap.pop()?;
let freq = a.freq() + b.freq();
heap.push(Reverse(FreqOrd(HTree::Node(
Box::new(a),
Box::new(b),
freq,
))));
}
heap.pop().map(|Reverse(FreqOrd(t))| t)
}
// āā Solution 2: Functional/recursive ā mirrors OCaml list-sort style āāāāāāāāā
//
// Keeps a sorted Vec and re-sorts after each merge, exactly as the OCaml
// `go` function does. More transparent but O(n² log n).
pub fn build_tree_recursive(freqs: &[(char, u32)]) -> Option<HTree> {
if freqs.is_empty() {
return None;
}
let mut trees: Vec<HTree> = freqs.iter().map(|&(c, f)| HTree::Leaf(c, f)).collect();
trees.sort_by_key(|t| t.freq());
fn go(mut trees: Vec<HTree>) -> Option<HTree> {
match trees.len() {
0 => None,
1 => trees.into_iter().next(),
_ => {
// Remove two cheapest (front of sorted list)
let a = trees.remove(0);
let b = trees.remove(0);
let freq = a.freq() + b.freq();
let merged = HTree::Node(Box::new(a), Box::new(b), freq);
// Re-insert and re-sort, then recurse
let mut next: Vec<HTree> = std::iter::once(merged).chain(trees).collect();
next.sort_by_key(|t| t.freq());
go(next)
}
}
}
go(trees)
}
// āā Code generation āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
//
// Traverse the tree, prepending "0" for left branches and "1" for right.
// Returns (char, code-string) pairs; leaves carry the final code.
pub fn codes(tree: &HTree) -> Vec<(char, String)> {
fn go(node: &HTree, prefix: String, out: &mut Vec<(char, String)>) {
match node {
HTree::Leaf(c, _) => out.push((*c, prefix)),
HTree::Node(left, right, _) => {
go(left, format!("{prefix}0"), out);
go(right, format!("{prefix}1"), out);
}
}
}
let mut out = Vec::new();
go(tree, String::new(), &mut out);
out
}
// āā Tests āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
#[cfg(test)]
mod tests {
use super::*;
use std::collections::HashSet;
fn sample_freqs() -> Vec<(char, u32)> {
vec![
('a', 5),
('b', 9),
('c', 12),
('d', 13),
('e', 16),
('f', 45),
]
}
#[test]
fn test_single_char_produces_one_empty_code() {
let tree = build_tree(&[('x', 7)]).unwrap();
let c = codes(&tree);
assert_eq!(c.len(), 1);
assert_eq!(c[0].0, 'x');
// A single leaf has no branching; the convention gives it an empty prefix.
assert_eq!(c[0].1, "");
}
#[test]
fn test_two_chars_get_single_bit_codes() {
let tree = build_tree(&[('a', 3), ('b', 7)]).unwrap();
let c = codes(&tree);
assert_eq!(c.len(), 2);
// Both codes must be exactly 1 bit
assert!(c.iter().all(|(_, code)| code.len() == 1));
// Codes must differ
assert_ne!(c[0].1, c[1].1);
}
#[test]
fn test_highest_frequency_gets_shortest_code() {
let tree = build_tree(&sample_freqs()).unwrap();
let c = codes(&tree);
let f_len = c.iter().find(|(ch, _)| *ch == 'f').unwrap().1.len();
let a_len = c.iter().find(|(ch, _)| *ch == 'a').unwrap().1.len();
// f (freq 45) must have a strictly shorter code than a (freq 5)
assert!(
f_len < a_len,
"f code len {f_len} should be < a code len {a_len}"
);
}
#[test]
fn test_codes_are_unique_and_correct_count() {
let tree = build_tree(&sample_freqs()).unwrap();
let c = codes(&tree);
assert_eq!(c.len(), 6);
let unique: HashSet<&str> = c.iter().map(|(_, s)| s.as_str()).collect();
assert_eq!(unique.len(), 6, "all codes must be distinct");
}
#[test]
fn test_root_frequency_equals_total() {
let freqs = sample_freqs();
let tree = build_tree(&freqs).unwrap();
let total: u32 = freqs.iter().map(|(_, f)| f).sum();
assert_eq!(tree.freq(), total);
}
#[test]
fn test_recursive_and_heap_produce_same_total_frequency() {
let freqs = sample_freqs();
let t1 = build_tree(&freqs).unwrap();
let t2 = build_tree_recursive(&freqs).unwrap();
assert_eq!(t1.freq(), t2.freq());
assert_eq!(codes(&t1).len(), codes(&t2).len());
}
#[test]
fn test_empty_input_returns_none() {
assert!(build_tree(&[]).is_none());
assert!(build_tree_recursive(&[]).is_none());
}
#[test]
fn test_all_codes_are_binary_strings() {
let tree = build_tree(&sample_freqs()).unwrap();
for (_, code) in codes(&tree) {
assert!(
code.chars().all(|c| c == '0' || c == '1'),
"code {code:?} contains non-binary characters"
);
}
}
}#[cfg(test)]
mod tests {
use super::*;
use std::collections::HashSet;
fn sample_freqs() -> Vec<(char, u32)> {
vec![
('a', 5),
('b', 9),
('c', 12),
('d', 13),
('e', 16),
('f', 45),
]
}
#[test]
fn test_single_char_produces_one_empty_code() {
let tree = build_tree(&[('x', 7)]).unwrap();
let c = codes(&tree);
assert_eq!(c.len(), 1);
assert_eq!(c[0].0, 'x');
// A single leaf has no branching; the convention gives it an empty prefix.
assert_eq!(c[0].1, "");
}
#[test]
fn test_two_chars_get_single_bit_codes() {
let tree = build_tree(&[('a', 3), ('b', 7)]).unwrap();
let c = codes(&tree);
assert_eq!(c.len(), 2);
// Both codes must be exactly 1 bit
assert!(c.iter().all(|(_, code)| code.len() == 1));
// Codes must differ
assert_ne!(c[0].1, c[1].1);
}
#[test]
fn test_highest_frequency_gets_shortest_code() {
let tree = build_tree(&sample_freqs()).unwrap();
let c = codes(&tree);
let f_len = c.iter().find(|(ch, _)| *ch == 'f').unwrap().1.len();
let a_len = c.iter().find(|(ch, _)| *ch == 'a').unwrap().1.len();
// f (freq 45) must have a strictly shorter code than a (freq 5)
assert!(
f_len < a_len,
"f code len {f_len} should be < a code len {a_len}"
);
}
#[test]
fn test_codes_are_unique_and_correct_count() {
let tree = build_tree(&sample_freqs()).unwrap();
let c = codes(&tree);
assert_eq!(c.len(), 6);
let unique: HashSet<&str> = c.iter().map(|(_, s)| s.as_str()).collect();
assert_eq!(unique.len(), 6, "all codes must be distinct");
}
#[test]
fn test_root_frequency_equals_total() {
let freqs = sample_freqs();
let tree = build_tree(&freqs).unwrap();
let total: u32 = freqs.iter().map(|(_, f)| f).sum();
assert_eq!(tree.freq(), total);
}
#[test]
fn test_recursive_and_heap_produce_same_total_frequency() {
let freqs = sample_freqs();
let t1 = build_tree(&freqs).unwrap();
let t2 = build_tree_recursive(&freqs).unwrap();
assert_eq!(t1.freq(), t2.freq());
assert_eq!(codes(&t1).len(), codes(&t2).len());
}
#[test]
fn test_empty_input_returns_none() {
assert!(build_tree(&[]).is_none());
assert!(build_tree_recursive(&[]).is_none());
}
#[test]
fn test_all_codes_are_binary_strings() {
let tree = build_tree(&sample_freqs()).unwrap();
for (_, code) in codes(&tree) {
assert!(
code.chars().all(|c| c == '0' || c == '1'),
"code {code:?} contains non-binary characters"
);
}
}
}
Deep Comparison
OCaml vs Rust: Huffman Encoding
Side-by-Side Code
OCaml
type htree = Leaf of char * int | Node of htree * htree * int
let freq = function Leaf (_,f) -> f | Node (_,_,f) -> f
let build_tree freqs =
let trees = List.map (fun (c,f) -> Leaf (c,f)) freqs
|> List.sort (fun a b -> compare (freq a) (freq b)) in
let rec go = function
| [t] -> t
| a :: b :: rest ->
let merged = Node (a, b, freq a + freq b) in
let trees = List.sort (fun a b -> compare (freq a) (freq b)) (merged :: rest) in
go trees
| [] -> failwith "empty"
in go trees
let rec codes prefix = function
| Leaf (c, _) -> [(c, prefix)]
| Node (l, r, _) -> codes (prefix ^ "0") l @ codes (prefix ^ "1") r
Rust (idiomatic ā BinaryHeap min-heap)
use std::cmp::{Ordering, Reverse};
use std::collections::BinaryHeap;
pub enum HTree {
Leaf(char, u32),
Node(Box<HTree>, Box<HTree>, u32),
}
// Newtype that orders by frequency only, enabling min-heap via Reverse<FreqOrd>
struct FreqOrd(HTree);
impl Ord for FreqOrd {
fn cmp(&self, other: &Self) -> Ordering {
self.0.freq().cmp(&other.0.freq())
}
}
pub fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> {
let mut heap: BinaryHeap<Reverse<FreqOrd>> = freqs
.iter()
.map(|&(c, f)| Reverse(FreqOrd(HTree::Leaf(c, f))))
.collect();
while heap.len() > 1 {
let Reverse(FreqOrd(a)) = heap.pop()?;
let Reverse(FreqOrd(b)) = heap.pop()?;
let freq = a.freq() + b.freq();
heap.push(Reverse(FreqOrd(HTree::Node(Box::new(a), Box::new(b), freq))));
}
heap.pop().map(|Reverse(FreqOrd(t))| t)
}
Rust (functional/recursive ā mirrors OCaml list-sort)
pub fn build_tree_recursive(freqs: &[(char, u32)]) -> Option<HTree> {
let mut trees: Vec<HTree> = freqs.iter()
.map(|&(c, f)| HTree::Leaf(c, f))
.collect();
trees.sort_by_key(|t| t.freq());
fn go(mut trees: Vec<HTree>) -> Option<HTree> {
match trees.len() {
0 => None,
1 => trees.into_iter().next(),
_ => {
let a = trees.remove(0);
let b = trees.remove(0);
let freq = a.freq() + b.freq();
let merged = HTree::Node(Box::new(a), Box::new(b), freq);
let mut next: Vec<HTree> = std::iter::once(merged).chain(trees).collect();
next.sort_by_key(|t| t.freq());
go(next)
}
}
}
go(trees)
}
Type Signatures
| Concept | OCaml | Rust |
|---|---|---|
| Tree type | type htree = Leaf of char * int \| Node of htree * htree * int | enum HTree { Leaf(char, u32), Node(Box<HTree>, Box<HTree>, u32) } |
| Frequency accessor | let freq = function Leaf (_,f) -> f \| Node (_,_,f) -> f | fn freq(&self) -> u32 { match self { Leaf(_, f) \| Node(_, _, f) => *f } } |
| Build function | val build_tree : (char * int) list -> htree | fn build_tree(freqs: &[(char, u32)]) -> Option<HTree> |
| Code generator | val codes : string -> htree -> (char * string) list | fn codes(tree: &HTree) -> Vec<(char, String)> |
| Priority queue | List.sort (re-sort each step) | BinaryHeap<Reverse<FreqOrd>> |
Key Insights
htree is natively recursive because the runtime uses heap-allocated values uniformly. Rust's enum must know its
size at compile time, so child nodes must be Box<HTree>. This is not
boilerplate ā it is an explicit ownership declaration.
idiomatic choice is List.sort. Rust's std::collections::BinaryHeap is a
max-heap; wrapping values in Reverse<T> flips the ordering to give an
efficient O(log n) min-heap, replacing the O(n log n) re-sort in the
recursive version.
function (fun a b -> compare (freq a) (freq b)) directly into List.sort.
Rust encodes ordering in the type system via impl Ord for FreqOrd, keeping
comparison logic declared once and reused implicitly wherever the type is
ordered. The FreqOrd newtype isolates this concern so HTree itself
carries no ordering assumption.
^ operator copies both operands into a new string on every recursive call ā O(depth) allocation per
leaf. The Rust version passes an owned String down the call stack with
format!("{prefix}0"), which is the same cost but makes the allocation
explicit and borrow-checker-safe without needing &str lifetime annotations.
go function matches | a :: b :: rest on a list head. Rust uses trees.remove(0) on a Vec,
which is O(n) but equivalent. A slice-pattern version [a, b, rest @ ..]
is possible with Box<[HTree]> but adds complexity for no algorithmic gain.
When to Use Each Style
Use idiomatic Rust (BinaryHeap) when: building or processing large priority queues where O(log n) insert/remove matters ā real compression pipelines, scheduler implementations, or any greedy algorithm over many elements.
Use recursive Rust (Vec + sort) when: the data set is small (< 256 symbols for standard Huffman), the directness of the OCaml translation aids comprehension, or you are porting OCaml code and want a mechanical first-pass before optimising.