๐Ÿฆ€ Functional Rust

073: Word Count with Map

Difficulty: 3 Level: Intermediate Build a word-frequency map from text โ€” tokenize, normalize, and fold into a `HashMap` in one pipeline.

The Problem This Solves

Word frequency is a foundational text-processing task: search engines, spell checkers, content analysis, and compression algorithms all start here. You need to split text into words, normalize them (lowercase, strip punctuation), and count occurrences. Doing this naively requires multiple loops and mutable bookkeeping. The functional approach chains tokenization and accumulation into a clean pipeline โ€” and choosing between `HashMap` (fast) and `BTreeMap` (sorted) is just a type swap.

The Intuition

The OCaml pattern uses `StringMap.add` with functional update: each word addition returns a new map. Rust's `HashMap` uses `.entry().or_insert(0)` to mutate in place โ€” more efficient but slightly more verbose. The `entry` API is the idiomatic Rust idiom for "insert if missing, then update." It's one atomic operation: find the slot, create it with a default if empty, return a mutable reference to the value. Python equivalent: `counts[word] = counts.get(word, 0) + 1`. Rust: `*map.entry(word).or_insert(0) += 1`.

How It Works in Rust

use std::collections::HashMap;

// Tokenize: lowercase everything, extract alphanumeric words
pub fn tokenize(s: &str) -> Vec<String> {
 let s = s.to_lowercase();
 let mut words = Vec::new();
 let mut buf = String::new();

 for c in s.chars() {
     if c.is_alphanumeric() {
         buf.push(c);
     } else if !buf.is_empty() {
         words.push(buf.clone());
         buf.clear();
     }
 }
 if !buf.is_empty() { words.push(buf); } // flush last word
 words
}

// Imperative style: entry API for in-place mutation
pub fn word_count(sentence: &str) -> HashMap<String, usize> {
 let mut map = HashMap::new();
 for word in tokenize(sentence) {
     *map.entry(word).or_insert(0) += 1;
     //   ^^^^ find or create slot
     //                ^^^^^^^^^ default value if new
     //                           ^^^ dereference to increment
 }
 map
}

// Functional style: fold over tokens
pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
 tokenize(sentence).into_iter().fold(
     HashMap::new(),
     |mut map, word| {
         *map.entry(word).or_insert(0) += 1;
         map
     },
 )
}

// Top N words by frequency โ€” sort by count descending
pub fn top_n(map: &HashMap<String, usize>, n: usize) -> Vec<(&str, usize)> {
 let mut pairs: Vec<_> = map.iter().map(|(k, &v)| (k.as_str(), v)).collect();
 pairs.sort_by(|a, b| b.1.cmp(&a.1));
 pairs.truncate(n);
 pairs
}
Use `BTreeMap` instead of `HashMap` if you need deterministic ordering (e.g., for tests or sorted output). Same API, O(log n) per operation instead of O(1) amortized.

What This Unlocks

Key Differences

ConceptOCamlRust
Map type`Map.Make(String)` (immutable)`HashMap<String, V>` (mutable)
Insert / update`Map.add k v m` (returns new map)`map.entry(k).or_insert(v)` (mutates)
ComplexityO(log n) treeO(1) amortized hash
Sorted outputNatural (BST)`BTreeMap` or sort manually
Fold to map`List.fold_left``.fold(HashMap::new(), ...)`
/// Word Count with Map
///
/// Building a word-frequency map from text. Demonstrates string
/// normalization, splitting, and folding into a map.

use std::collections::HashMap;
use std::collections::BTreeMap;

/// Tokenize: lowercase and extract alphanumeric words.
pub fn tokenize(s: &str) -> Vec<String> {
    let s = s.to_lowercase();
    let mut words = Vec::new();
    let mut buf = String::new();

    for c in s.chars() {
        if c.is_alphanumeric() {
            buf.push(c);
        } else if !buf.is_empty() {
            words.push(buf.clone());
            buf.clear();
        }
    }
    if !buf.is_empty() {
        words.push(buf);
    }
    words
}

/// Word count using HashMap โ€” O(1) average lookup.
pub fn word_count(sentence: &str) -> HashMap<String, usize> {
    let mut map = HashMap::new();
    for word in tokenize(sentence) {
        *map.entry(word).or_insert(0) += 1;
    }
    map
}

/// Word count using iterator fold โ€” more functional style.
pub fn word_count_fold(sentence: &str) -> HashMap<String, usize> {
    tokenize(sentence)
        .into_iter()
        .fold(HashMap::new(), |mut map, word| {
            *map.entry(word).or_insert(0) += 1;
            map
        })
}

/// Ordered word count using BTreeMap (like OCaml's Map).
pub fn word_count_ordered(sentence: &str) -> BTreeMap<String, usize> {
    let mut map = BTreeMap::new();
    for word in tokenize(sentence) {
        *map.entry(word).or_insert(0) += 1;
    }
    map
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_basic() {
        let m = word_count("the cat sat on the mat");
        assert_eq!(m.get("the"), Some(&2));
        assert_eq!(m.get("cat"), Some(&1));
    }

    #[test]
    fn test_punctuation() {
        let m = word_count("the cat sat on the mat, the cat sat");
        assert_eq!(m.get("the"), Some(&3));
        assert_eq!(m.get("cat"), Some(&2));
        assert_eq!(m.get("sat"), Some(&2));
    }

    #[test]
    fn test_case_insensitive() {
        let m = word_count("The THE the");
        assert_eq!(m.get("the"), Some(&3));
    }

    #[test]
    fn test_empty() {
        let m = word_count("");
        assert!(m.is_empty());
    }

    #[test]
    fn test_single_word() {
        let m = word_count("hello");
        assert_eq!(m.get("hello"), Some(&1));
        assert_eq!(m.len(), 1);
    }

    #[test]
    fn test_fold_matches() {
        let s = "the cat sat on the mat";
        assert_eq!(word_count(s), word_count_fold(s));
    }

    #[test]
    fn test_ordered() {
        let m = word_count_ordered("banana apple cherry apple");
        let keys: Vec<_> = m.keys().collect();
        assert_eq!(keys, vec!["apple", "banana", "cherry"]);
    }
}

fn main() {
    println!("{:?}", m.get("the"), Some(&2));
    println!("{:?}", m.get("cat"), Some(&1));
    println!("{:?}", m.get("the"), Some(&3));
}
module StringMap = Map.Make(String)

let tokenize s =
  let s = String.lowercase_ascii s in
  let words = ref [] and buf = Buffer.create 16 in
  let flush () =
    if Buffer.length buf > 0 then begin
      words := Buffer.contents buf :: !words;
      Buffer.clear buf
    end
  in
  String.iter (fun c ->
    if (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9') then
      Buffer.add_char buf c
    else flush ()
  ) s;
  flush ();
  List.rev !words

let word_count sentence =
  tokenize sentence
  |> List.fold_left (fun m w ->
       let n = Option.value ~default:0 (StringMap.find_opt w m) in
       StringMap.add w (n + 1) m)
     StringMap.empty

let () =
  let m = word_count "the cat sat on the mat, the cat sat" in
  assert (StringMap.find "the" m = 3);
  assert (StringMap.find "cat" m = 2);
  assert (StringMap.find "sat" m = 2);
  print_endline "All assertions passed."

๐Ÿ“Š Detailed Comparison

Word Count with Map: OCaml vs Rust

The Core Insight

Word counting combines string processing with map operations โ€” a practical pattern that reveals how each language handles mutable vs immutable data structures. OCaml builds a new map on every insert; Rust mutates a HashMap in place. Both are correct; the performance characteristics differ significantly.

OCaml Approach

OCaml uses `StringMap` (from `Map.Make(String)`) โ€” an immutable balanced BST. Each `StringMap.add` creates a new map sharing most of its structure with the old one. The fold accumulates by threading the map through each word. String tokenization uses a mutable `Buffer` and `ref` โ€” one of the few places where mutation is pragmatic in OCaml. `Option.value ~default:0` handles the missing-key case.

Rust Approach

Rust's `HashMap` is a mutable hash table with O(1) amortized insert/lookup. The `entry` API (`map.entry(word).or_insert(0)`) is a Rust-specific ergonomic pattern that handles both insert and update in one call. The fold variant passes `mut map` through the closure. `BTreeMap` is available when ordered output is needed (like OCaml's `Map`). Tokenization uses iterators and `String::push` for building words.

Side-by-Side

ConceptOCamlRust
Map type`StringMap` (immutable BST)`HashMap` (mutable hash table)
Insert`StringMap.add k v m` โ†’ new map`map.entry(k).or_insert(0) += 1`
Lookup`StringMap.find_opt k m``map.get(&k)`
OrderedYes (BST)`BTreeMap` for ordered, `HashMap` for fast
Tokenization`Buffer` + `ref` (mutable)`String::push` + `Vec`
ComplexityO(n log n) totalO(n) amortized total

What Rust Learners Should Notice

  • The `entry` API is uniquely powerful: it returns a mutable reference to the value (inserting a default if missing), avoiding double-lookup
  • `*map.entry(word).or_insert(0) += 1` is the idiomatic one-liner for frequency counting in Rust
  • `HashMap` is unordered; use `BTreeMap` if you need sorted keys (like OCaml's `Map`)
  • OCaml's immutable map naturally supports persistent snapshots; Rust's mutable map is faster but requires explicit cloning for snapshots
  • String normalization (`to_lowercase()`) returns a new `String` in Rust โ€” strings are always UTF-8, never mutated in place

Further Reading

  • [The Rust Book โ€” Hash Maps](https://doc.rust-lang.org/book/ch08-03-hash-maps.html)
  • [Exercism Word Count](https://exercism.org/tracks/ocaml/exercises/word-count)