487: String Interning
Difficulty: 2 Level: Intermediate Cache identical strings so you store each unique value once and compare by pointer, not content.The Problem This Solves
In compilers, parsers, and data pipelines you often encounter the same string thousands or millions of times: variable names, field names, log levels, city names. Allocating a fresh `String` for each occurrence wastes memory and makes equality checks O(n) in string length. String interning solves both problems. The intern table maps each unique string to a shared reference. The second time you see `"identifier"`, you get back the same pointer you got the first time. Equality check becomes pointer comparison โ O(1). Memory usage drops to one copy per unique string. This is how language runtimes store symbol tables, how databases store column names, and how many serialization formats handle repeated keys.The Intuition
A coin minting machine. Every request for a "quarter" returns the same physical coin from the vault, not a freshly minted one. Anyone holding a quarter knows it's the same value as any other quarter โ they don't need to inspect the details, just compare the reference. The vault only grows when a truly new denomination is requested.How It Works in Rust
1. Simple interner with `Arc<str>`:use std::collections::HashMap;
use std::sync::{Arc, Mutex};
struct Interner(Mutex<HashMap<String, Arc<str>>>);
impl Interner {
fn intern(&self, s: &str) -> Arc<str> {
let mut map = self.0.lock().unwrap();
if let Some(interned) = map.get(s) {
return interned.clone();
}
let arc: Arc<str> = s.into();
map.insert(s.to_string(), arc.clone());
arc
}
}
2. Compare interned strings by pointer:
let a = interner.intern("hello");
let b = interner.intern("hello");
assert!(Arc::ptr_eq(&a, &b)); // same allocation
3. Single-threaded โ swap `Mutex` for `RefCell` in non-concurrent contexts.
4. Typed symbols โ wrap `Arc<str>` in a newtype for type safety:
#[derive(Clone, PartialEq, Eq, Hash)]
struct Symbol(Arc<str>);
What This Unlocks
- O(1) equality โ pointer comparison replaces string scan; hash maps keyed on interned strings run faster.
- Memory efficiency โ a log processor that sees `"ERROR"` a million times stores it once.
- Symbol tables โ the foundation for variable binding in interpreters and compilers.
Key Differences
| Concept | OCaml | Rust |
|---|---|---|
| Intern mechanism | `String.intern` (stdlib!) | Manual `HashMap<String, Arc<str>>` |
| Reference type | `string` (already internable) | `Arc<str>` or newtype wrapper |
| Thread safety | GC-managed | `Mutex` around intern table |
| Equality check | `==` (may intern automatically) | `Arc::ptr_eq` for pointer comparison |
// 487. String interning pattern
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
pub struct Interner {
table: HashMap<String, Arc<str>>,
}
impl Interner {
pub fn new() -> Self { Interner { table: HashMap::new() } }
pub fn intern(&mut self, s: &str) -> Arc<str> {
if let Some(v) = self.table.get(s) { return Arc::clone(v); }
let arc: Arc<str> = Arc::from(s);
self.table.insert(s.to_string(), Arc::clone(&arc));
arc
}
pub fn len(&self) -> usize { self.table.len() }
}
// Thread-safe interner
pub struct SyncInterner(Mutex<Interner>);
impl SyncInterner {
pub fn new() -> Self { SyncInterner(Mutex::new(Interner::new())) }
pub fn intern(&self, s: &str) -> Arc<str> { self.0.lock().unwrap().intern(s) }
}
fn main() {
let mut interner = Interner::new();
let a = interner.intern("hello");
let b = interner.intern("world");
let c = interner.intern("hello"); // same as a
println!("a='{}' b='{}' c='{}'", a, b, c);
// O(1) equality: compare Arc pointers
println!("a==c: {}", Arc::ptr_eq(&a, &c)); // true
println!("a==b: {}", Arc::ptr_eq(&a, &b)); // false
println!("table size: {}", interner.len()); // 2
// Simulate many repeated strings
let words = vec!["foo","bar","baz","foo","bar","foo","qux","baz"];
let mut interner2 = Interner::new();
let interned: Vec<Arc<str>> = words.iter().map(|&w| interner2.intern(w)).collect();
println!("interned {} strings, {} unique", interned.len(), interner2.len());
// O(1) equality check
println!("words[0]==words[3]: {}", Arc::ptr_eq(&interned[0], &interned[3]));
}
#[cfg(test)]
mod tests {
use super::*;
#[test] fn test_intern_same() { let mut i=Interner::new(); let a=i.intern("hi"); let b=i.intern("hi"); assert!(Arc::ptr_eq(&a,&b)); }
#[test] fn test_intern_diff() { let mut i=Interner::new(); let a=i.intern("hi"); let b=i.intern("ho"); assert!(!Arc::ptr_eq(&a,&b)); }
#[test] fn test_intern_size() { let mut i=Interner::new(); i.intern("a"); i.intern("b"); i.intern("a"); assert_eq!(i.len(),2); }
}
(* 487. String interning โ OCaml *)
module StringSet = Set.Make(String)
(* Simple interner: returns the canonical string *)
let table : (string, string) Hashtbl.t = Hashtbl.create 64
let intern s =
match Hashtbl.find_opt table s with
| Some v -> v
| None -> Hashtbl.add table s s; s
let () =
let a = intern "hello" in
let b = intern "world" in
let c = intern "hello" in
Printf.printf "a=%s b=%s c=%s\n" a b c;
(* In OCaml, physical equality (==) checks pointer *)
Printf.printf "a==c: %b\n" (a == c); (* true: same string *)
Printf.printf "a==b: %b\n" (a == b); (* false *)
Printf.printf "intern table size: %d\n" (Hashtbl.length table)