073: Word Count with Map
Difficulty: 3 Level: Intermediate Build a word-frequency map from text โ tokenize, normalize, and fold into a `HashMap` in one pipeline.The Problem This Solves
Word frequency is a foundational text-processing task: search engines, spell checkers, content analysis, and compression algorithms all start here. You need to split text into words, normalize them (lowercase, strip punctuation), and count occurrences. Doing this naively requires multiple loops and mutable bookkeeping. The functional approach chains tokenization and accumulation into a clean pipeline โ and choosing between `HashMap` (fast) and `BTreeMap` (sorted) is just a type swap.The Intuition
The OCaml pattern uses `StringMap.add` with functional update: each word addition returns a new map. Rust's `HashMap` uses `.entry().or_insert(0)` to mutate in place โ more efficient but slightly more verbose. The `entry` API is the idiomatic Rust idiom for "insert if missing, then update." It's one atomic operation: find the slot, create it with a default if empty, return a mutable reference to the value. Python equivalent: `counts[word] = counts.get(word, 0) + 1`. Rust: `*map.entry(word).or_insert(0) += 1`.How It Works in Rust
use std::collections::HashMap;
// Tokenize: lowercase everything, extract alphanumeric words
pub fn tokenize(s: &str) -> Vec<String> {
let s = s.to_lowercase();
let mut words = Vec::new();
let mut buf = String::new();
for c in s.chars() {
if c.is_alphanumeric() {
buf.push(c);
} else if !buf.is_empty() {
words.push(buf.clone());
buf.clear();
}
}
if !buf.is_empty() { words.push(buf); } // flush last word
words
}
// Imperative style: entry API for in-place mutation
pub fn word_count(sentence: &str) -> HashMap<String, usize> {
let mut map = HashMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
// ^^^^ find or create slot
// ^^^^^^^^^ default value if new
// ^^^ dereference to increment
}
map
}
// Functional style: fold over tokens
pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
tokenize(sentence).into_iter().fold(
HashMap::new(),
|mut map, word| {
*map.entry(word).or_insert(0) += 1;
map
},
)
}
// Top N words by frequency โ sort by count descending
pub fn top_n(map: &HashMap<String, usize>, n: usize) -> Vec<(&str, usize)> {
let mut pairs: Vec<_> = map.iter().map(|(k, &v)| (k.as_str(), v)).collect();
pairs.sort_by(|a, b| b.1.cmp(&a.1));
pairs.truncate(n);
pairs
}
Use `BTreeMap` instead of `HashMap` if you need deterministic ordering (e.g., for tests or sorted output). Same API, O(log n) per operation instead of O(1) amortized.
What This Unlocks
- Text analysis: frequency analysis, n-gram counts, top-k words.
- Histogram building: any key-counting pattern uses the same `entry().or_insert(0) += 1` idiom.
- Caching / memoization: `entry().or_insert_with(|| expensive_computation())` for lazy population.
Key Differences
| Concept | OCaml | Rust |
|---|---|---|
| Map type | `Map.Make(String)` (immutable) | `HashMap<String, V>` (mutable) |
| Insert / update | `Map.add k v m` (returns new map) | `map.entry(k).or_insert(v)` (mutates) |
| Complexity | O(log n) tree | O(1) amortized hash |
| Sorted output | Natural (BST) | `BTreeMap` or sort manually |
| Fold to map | `List.fold_left` | `.fold(HashMap::new(), ...)` |
/// Word Count with Map
///
/// Building a word-frequency map from text. Demonstrates string
/// normalization, splitting, and folding into a map.
use std::collections::HashMap;
use std::collections::BTreeMap;
/// Tokenize: lowercase and extract alphanumeric words.
pub fn tokenize(s: &str) -> Vec<String> {
let s = s.to_lowercase();
let mut words = Vec::new();
let mut buf = String::new();
for c in s.chars() {
if c.is_alphanumeric() {
buf.push(c);
} else if !buf.is_empty() {
words.push(buf.clone());
buf.clear();
}
}
if !buf.is_empty() {
words.push(buf);
}
words
}
/// Word count using HashMap โ O(1) average lookup.
pub fn word_count(sentence: &str) -> HashMap<String, usize> {
let mut map = HashMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
}
map
}
/// Word count using iterator fold โ more functional style.
pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
tokenize(sentence)
.into_iter()
.fold(HashMap::new(), |mut map, word| {
*map.entry(word).or_insert(0) += 1;
map
})
}
/// Ordered word count using BTreeMap (like OCaml's Map).
pub fn word_count_ordered(sentence: &str) -> BTreeMap<String, usize> {
let mut map = BTreeMap::new();
for word in tokenize(sentence) {
*map.entry(word).or_insert(0) += 1;
}
map
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_basic() {
let m = word_count("the cat sat on the mat");
assert_eq!(m.get("the"), Some(&2));
assert_eq!(m.get("cat"), Some(&1));
}
#[test]
fn test_punctuation() {
let m = word_count("the cat sat on the mat, the cat sat");
assert_eq!(m.get("the"), Some(&3));
assert_eq!(m.get("cat"), Some(&2));
assert_eq!(m.get("sat"), Some(&2));
}
#[test]
fn test_case_insensitive() {
let m = word_count("The THE the");
assert_eq!(m.get("the"), Some(&3));
}
#[test]
fn test_empty() {
let m = word_count("");
assert!(m.is_empty());
}
#[test]
fn test_single_word() {
let m = word_count("hello");
assert_eq!(m.get("hello"), Some(&1));
assert_eq!(m.len(), 1);
}
#[test]
fn test_fold_matches() {
let s = "the cat sat on the mat";
assert_eq!(word_count(s), word_count_fold(s));
}
#[test]
fn test_ordered() {
let m = word_count_ordered("banana apple cherry apple");
let keys: Vec<_> = m.keys().collect();
assert_eq!(keys, vec!["apple", "banana", "cherry"]);
}
}
fn main() {
println!("{:?}", m.get("the"), Some(&2));
println!("{:?}", m.get("cat"), Some(&1));
println!("{:?}", m.get("the"), Some(&3));
}
module StringMap = Map.Make(String)
let tokenize s =
let s = String.lowercase_ascii s in
let words = ref [] and buf = Buffer.create 16 in
let flush () =
if Buffer.length buf > 0 then begin
words := Buffer.contents buf :: !words;
Buffer.clear buf
end
in
String.iter (fun c ->
if (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') then
Buffer.add_char buf c
else flush ()
) s;
flush ();
List.rev !words
let word_count sentence =
tokenize sentence
|> List.fold_left (fun m w ->
let n = Option.value ~default:0 (StringMap.find_opt w m) in
StringMap.add w (n + 1) m)
StringMap.empty
let () =
let m = word_count "the cat sat on the mat, the cat sat" in
assert (StringMap.find "the" m = 3);
assert (StringMap.find "cat" m = 2);
assert (StringMap.find "sat" m = 2);
print_endline "All assertions passed."
๐ Detailed Comparison
Word Count with Map: OCaml vs Rust
The Core Insight
Word counting combines string processing with map operations โ a practical pattern that reveals how each language handles mutable vs immutable data structures. OCaml builds a new map on every insert; Rust mutates a HashMap in place. Both are correct; the performance characteristics differ significantly.OCaml Approach
OCaml uses `StringMap` (from `Map.Make(String)`) โ an immutable balanced BST. Each `StringMap.add` creates a new map sharing most of its structure with the old one. The fold accumulates by threading the map through each word. String tokenization uses a mutable `Buffer` and `ref` โ one of the few places where mutation is pragmatic in OCaml. `Option.value ~default:0` handles the missing-key case.Rust Approach
Rust's `HashMap` is a mutable hash table with O(1) amortized insert/lookup. The `entry` API (`map.entry(word).or_insert(0)`) is a Rust-specific ergonomic pattern that handles both insert and update in one call. The fold variant passes `mut map` through the closure. `BTreeMap` is available when ordered output is needed (like OCaml's `Map`). Tokenization uses iterators and `String::push` for building words.Side-by-Side
| Concept | OCaml | Rust |
|---|---|---|
| Map type | `StringMap` (immutable BST) | `HashMap` (mutable hash table) |
| Insert | `StringMap.add k v m` โ new map | `map.entry(k).or_insert(0) += 1` |
| Lookup | `StringMap.find_opt k m` | `map.get(&k)` |
| Ordered | Yes (BST) | `BTreeMap` for ordered, `HashMap` for fast |
| Tokenization | `Buffer` + `ref` (mutable) | `String::push` + `Vec` |
| Complexity | O(n log n) total | O(n) amortized total |
What Rust Learners Should Notice
- The `entry` API is uniquely powerful: it returns a mutable reference to the value (inserting a default if missing), avoiding double-lookup
- `*map.entry(word).or_insert(0) += 1` is the idiomatic one-liner for frequency counting in Rust
- `HashMap` is unordered; use `BTreeMap` if you need sorted keys (like OCaml's `Map`)
- OCaml's immutable map naturally supports persistent snapshots; Rust's mutable map is faster but requires explicit cloning for snapshots
- String normalization (`to_lowercase()`) returns a new `String` in Rust โ strings are always UTF-8, never mutated in place
Further Reading
- [The Rust Book โ Hash Maps](https://doc.rust-lang.org/book/ch08-03-hash-maps.html)
- [Exercism Word Count](https://exercism.org/tracks/ocaml/exercises/word-count)