817: Aho-Corasick Multi-Pattern Matching Automaton
Difficulty: 5 Level: Master Search a text for all occurrences of multiple patterns simultaneously in O(|text| + Ξ£|patterns| + matches).The Problem This Solves
NaΓ―ve multi-pattern search runs each pattern separately: O(n Γ m) for n text characters and m total pattern length. Aho-Corasick builds a finite automaton from all patterns at once, then scans the text in a single pass β every character processed exactly once, regardless of how many patterns you're searching for. Use Aho-Corasick for network intrusion detection (matching thousands of attack signatures against a packet stream), antivirus scanning (matching malware signatures against file content), log analysis (finding any of a set of error patterns in a log), and DNA sequence search (matching multiple gene markers simultaneously). The classic grep -F (fixed-string multi-pattern) uses exactly this algorithm. The output is a list of `(pattern_id, end_position)` pairs for every match found. Overlapping matches across patterns are reported naturally.The Intuition
Three phases: 1. Build a trie from all patterns. Each pattern path ends at a terminal node marked with the pattern's id. 2. Add failure links (like KMP's failure function, but for a trie). The failure link of node `v` points to the longest proper suffix of `v`'s string that is also a prefix of some pattern. Computed via BFS from the root. 3. Scan the text: follow trie edges on match, follow failure links on mismatch. At each position, also follow output links to report all patterns that end here (including patterns that are suffixes of other patterns). The result is a deterministic finite automaton where every state transition is O(1). Total construction: O(Ξ£|patterns| Γ |alphabet|). Scan: O(|text| + matches). OCaml would represent this with a recursive trie node type and hash maps for children. Rust uses arena-allocated nodes with `HashMap<char, usize>` children β indices into a `Vec<Node>` instead of boxing.How It Works in Rust
use std::collections::{HashMap, VecDeque};
struct AhoCorasick {
goto: Vec<HashMap<char, usize>>, // goto[state][char] = next_state
fail: Vec<usize>, // failure link
output: Vec<Vec<usize>>, // output[state] = list of pattern ids ending here
}
impl AhoCorasick {
fn build(patterns: &[&str]) -> Self {
let mut goto: Vec<HashMap<char, usize>> = vec![HashMap::new()];
let mut output: Vec<Vec<usize>> = vec![vec![]];
// Phase 1: build trie
for (pid, pat) in patterns.iter().enumerate() {
let mut state = 0;
for ch in pat.chars() {
let next = goto[state].get(&ch).copied();
state = next.unwrap_or_else(|| {
let new = goto.len();
goto.push(HashMap::new());
output.push(vec![]);
goto[state].insert(ch, new);
new
});
}
output[state].push(pid); // pattern pid ends at this state
}
// Phase 2: BFS to set failure links
let n = goto.len();
let mut fail = vec![0usize; n];
let mut queue = VecDeque::new();
// Root's children fail back to root
for (&ch, &s) in &goto[0] {
fail[s] = 0;
queue.push_back(s);
}
while let Some(r) = queue.pop_front() {
for (&ch, &s) in goto[r].clone().iter() {
// Failure link of s: follow r's failure chain until we find
// a state with a goto on ch, or fall back to root
let mut f = fail[r];
while f != 0 && !goto[f].contains_key(&ch) { f = fail[f]; }
fail[s] = if goto[f].contains_key(&ch) && goto[f][&ch] != s {
goto[f][&ch]
} else { 0 };
// Merge output: patterns at fail[s] also match here
let inherited = output[fail[s]].clone();
output[s].extend(inherited);
queue.push_back(s);
}
}
AhoCorasick { goto, fail, output }
}
fn search(&self, text: &str) -> Vec<(usize, usize)> {
let mut state = 0;
let mut matches = vec![];
for (i, ch) in text.chars().enumerate() {
// Follow failure links until we find a goto or reach root
while state != 0 && !self.goto[state].contains_key(&ch) {
state = self.fail[state];
}
if let Some(&next) = self.goto[state].get(&ch) { state = next; }
// Report all patterns ending at current position
for &pid in &self.output[state] { matches.push((pid, i)); }
}
matches
}
}
The arena `Vec<HashMap<char, usize>>` pattern avoids pointer chasing through boxed nodes. `goto[r].clone().iter()` during BFS is needed because we borrow `goto` mutably elsewhere β in a production implementation, separate the construction phase to avoid this clone.
What This Unlocks
- Network intrusion detection: Snort/Suricata use Aho-Corasick variants to match thousands of attack signatures against each packet.
- Antivirus / malware scanning: match a database of binary signatures in a single pass over file content.
- Bioinformatics: simultaneous search for multiple DNA motifs or protein domains in genome sequences.
Key Differences
| Concept | OCaml | Rust |
|---|---|---|
| Trie node | Recursive variant type with `Hashtbl` children | `Vec<HashMap<char, usize>>` β arena, no boxing |
| Node allocation | `ref` cells or GC-managed heap | Index into `Vec` β explicit, cache-local |
| BFS queue | `Queue.t` | `VecDeque<usize>` |
| Output merging | List concatenation | `Vec::extend` β in-place |
| Clone during BFS | Not needed (GC copies) | `goto[r].clone()` to avoid borrow conflict during mutation |
// Aho-Corasick Multi-Pattern Matching β O(Ξ£|pats| + n + matches)
use std::collections::{HashMap, VecDeque};
#[derive(Default, Clone)]
struct AcNode {
children: HashMap<u8, usize>,
fail: usize,
output: Vec<usize>, // pattern indices ending here
}
struct AhoCorasick {
nodes: Vec<AcNode>,
patterns: Vec<String>,
}
impl AhoCorasick {
fn build(patterns: Vec<&str>) -> Self {
let pats: Vec<String> = patterns.iter().map(|s| s.to_string()).collect();
let mut nodes = vec![AcNode::default()];
// Build trie
for (pid, pat) in pats.iter().enumerate() {
let mut cur = 0;
for &c in pat.as_bytes() {
if !nodes[cur].children.contains_key(&c) {
let nid = nodes.len();
nodes.push(AcNode::default());
nodes[cur].children.insert(c, nid);
}
cur = nodes[cur].children[&c];
}
nodes[cur].output.push(pid);
}
// BFS for failure links
let mut queue = VecDeque::new();
let root_children: Vec<(u8, usize)> = nodes[0].children.iter().map(|(&c,&n)|(c,n)).collect();
for (_, child) in root_children { nodes[child].fail = 0; queue.push_back(child); }
while let Some(u) = queue.pop_front() {
// Collect children first to avoid borrow issues
let children: Vec<(u8, usize)> = nodes[u].children.iter().map(|(&c,&n)|(c,n)).collect();
for (c, v) in children {
// failure link for v
let mut f = nodes[u].fail;
loop {
if let Some(&nf) = nodes[f].children.get(&c) {
if nf != v { nodes[v].fail = nf; break; }
}
if f == 0 { nodes[v].fail = 0; break; }
f = nodes[f].fail;
}
// Merge output
let fail_output = nodes[nodes[v].fail].output.clone();
nodes[v].output.extend(fail_output);
queue.push_back(v);
}
}
AhoCorasick { nodes, patterns: pats }
}
fn search(&self, text: &str) -> Vec<(usize, usize)> {
let mut state = 0;
let mut matches = Vec::new();
for (i, &c) in text.as_bytes().iter().enumerate() {
loop {
if let Some(&next) = self.nodes[state].children.get(&c) {
state = next; break;
}
if state == 0 { break; }
state = self.nodes[state].fail;
}
for &pid in &self.nodes[state].output {
let start = i + 1 - self.patterns[pid].len();
matches.push((start, pid));
}
}
matches
}
}
fn main() {
let patterns = vec!["he", "she", "his", "hers"];
let text = "ushers";
let ac = AhoCorasick::build(patterns.clone());
let matches = ac.search(text);
println!("Text: {text:?}");
println!("Patterns: {:?}", patterns);
for (pos, pid) in &matches {
println!(" Found {:?} at position {pos}", patterns[*pid]);
}
}
(* Aho-Corasick Multi-Pattern Matching β O(Ξ£|patterns| + n + matches) *)
let alpha = 256
type node = {
children : int array;
mutable fail : int;
mutable output : int list; (* pattern indices ending at this node *)
}
let make_node () = { children = Array.make alpha (-1); fail = 0; output = [] }
let build_automaton patterns =
let nodes = ref [| make_node () |] in (* root at index 0 *)
(* Insert all patterns into trie *)
List.iteri (fun pid pat ->
let cur = ref 0 in
String.iter (fun c ->
let ci = Char.code c in
if !nodes.(!cur).children.(ci) = -1 then begin
!nodes.(!cur).children.(ci) <- Array.length !nodes;
nodes := Array.append !nodes [| make_node () |]
end;
cur := !nodes.(!cur).children.(ci)
) pat;
!nodes.(!cur).output <- pid :: !nodes.(!cur).output
) patterns;
(* BFS to compute failure links *)
let q = Queue.create () in
for c = 0 to alpha - 1 do
let ch = !nodes.(0).children.(c) in
if ch = -1 then !nodes.(0).children.(c) <- 0
else begin !nodes.(ch).fail <- 0; Queue.push ch q end
done;
while not (Queue.is_empty q) do
let u = Queue.pop q in
!nodes.(u).output <- !nodes.(u).output @
!nodes.(!nodes.(u).fail).output;
for c = 0 to alpha - 1 do
let v = !nodes.(u).children.(c) in
if v = -1 then
!nodes.(u).children.(c) <- !nodes.(!nodes.(u).fail).children.(c)
else begin
!nodes.(v).fail <- !nodes.(!nodes.(u).fail).children.(c);
Queue.push v q
end
done
done;
!nodes
let search_ac nodes text patterns =
let state = ref 0 in
let matches = ref [] in
String.iteri (fun i c ->
let ci = Char.code c in
state := nodes.(!state).children.(ci);
List.iter (fun pid ->
let pat = List.nth patterns pid in
matches := (i - String.length pat + 1, pid) :: !matches
) nodes.(!state).output
) text;
List.rev !matches
let () =
let patterns = ["he"; "she"; "his"; "hers"] in
let text = "ushers" in
let nodes = build_automaton patterns in
let matches = search_ac nodes text patterns in
Printf.printf "Text: %S\n" text;
Printf.printf "Patterns: [%s]\n" (String.concat "; " patterns);
List.iter (fun (pos, pid) ->
Printf.printf " Found %S at position %d\n" (List.nth patterns pid) pos
) matches