πŸ¦€ Functional Rust

817: Aho-Corasick Multi-Pattern Matching Automaton

Difficulty: 5 Level: Master Search a text for all occurrences of multiple patterns simultaneously in O(|text| + Ξ£|patterns| + matches).

The Problem This Solves

NaΓ―ve multi-pattern search runs each pattern separately: O(n Γ— m) for n text characters and m total pattern length. Aho-Corasick builds a finite automaton from all patterns at once, then scans the text in a single pass β€” every character processed exactly once, regardless of how many patterns you're searching for. Use Aho-Corasick for network intrusion detection (matching thousands of attack signatures against a packet stream), antivirus scanning (matching malware signatures against file content), log analysis (finding any of a set of error patterns in a log), and DNA sequence search (matching multiple gene markers simultaneously). The classic grep -F (fixed-string multi-pattern) uses exactly this algorithm. The output is a list of `(pattern_id, end_position)` pairs for every match found. Overlapping matches across patterns are reported naturally.

The Intuition

Three phases: 1. Build a trie from all patterns. Each pattern path ends at a terminal node marked with the pattern's id. 2. Add failure links (like KMP's failure function, but for a trie). The failure link of node `v` points to the longest proper suffix of `v`'s string that is also a prefix of some pattern. Computed via BFS from the root. 3. Scan the text: follow trie edges on match, follow failure links on mismatch. At each position, also follow output links to report all patterns that end here (including patterns that are suffixes of other patterns). The result is a deterministic finite automaton where every state transition is O(1). Total construction: O(Ξ£|patterns| Γ— |alphabet|). Scan: O(|text| + matches). OCaml would represent this with a recursive trie node type and hash maps for children. Rust uses arena-allocated nodes with `HashMap<char, usize>` children β€” indices into a `Vec<Node>` instead of boxing.

How It Works in Rust

use std::collections::{HashMap, VecDeque};

struct AhoCorasick {
 goto: Vec<HashMap<char, usize>>, // goto[state][char] = next_state
 fail: Vec<usize>,                // failure link
 output: Vec<Vec<usize>>,         // output[state] = list of pattern ids ending here
}

impl AhoCorasick {
 fn build(patterns: &[&str]) -> Self {
     let mut goto: Vec<HashMap<char, usize>> = vec![HashMap::new()];
     let mut output: Vec<Vec<usize>> = vec![vec![]];

     // Phase 1: build trie
     for (pid, pat) in patterns.iter().enumerate() {
         let mut state = 0;
         for ch in pat.chars() {
             let next = goto[state].get(&ch).copied();
             state = next.unwrap_or_else(|| {
                 let new = goto.len();
                 goto.push(HashMap::new());
                 output.push(vec![]);
                 goto[state].insert(ch, new);
                 new
             });
         }
         output[state].push(pid); // pattern pid ends at this state
     }

     // Phase 2: BFS to set failure links
     let n = goto.len();
     let mut fail = vec![0usize; n];
     let mut queue = VecDeque::new();

     // Root's children fail back to root
     for (&ch, &s) in &goto[0] {
         fail[s] = 0;
         queue.push_back(s);
     }

     while let Some(r) = queue.pop_front() {
         for (&ch, &s) in goto[r].clone().iter() {
             // Failure link of s: follow r's failure chain until we find
             // a state with a goto on ch, or fall back to root
             let mut f = fail[r];
             while f != 0 && !goto[f].contains_key(&ch) { f = fail[f]; }
             fail[s] = if goto[f].contains_key(&ch) && goto[f][&ch] != s {
                 goto[f][&ch]
             } else { 0 };
             // Merge output: patterns at fail[s] also match here
             let inherited = output[fail[s]].clone();
             output[s].extend(inherited);
             queue.push_back(s);
         }
     }
     AhoCorasick { goto, fail, output }
 }

 fn search(&self, text: &str) -> Vec<(usize, usize)> {
     let mut state = 0;
     let mut matches = vec![];
     for (i, ch) in text.chars().enumerate() {
         // Follow failure links until we find a goto or reach root
         while state != 0 && !self.goto[state].contains_key(&ch) {
             state = self.fail[state];
         }
         if let Some(&next) = self.goto[state].get(&ch) { state = next; }
         // Report all patterns ending at current position
         for &pid in &self.output[state] { matches.push((pid, i)); }
     }
     matches
 }
}
The arena `Vec<HashMap<char, usize>>` pattern avoids pointer chasing through boxed nodes. `goto[r].clone().iter()` during BFS is needed because we borrow `goto` mutably elsewhere β€” in a production implementation, separate the construction phase to avoid this clone.

What This Unlocks

Key Differences

ConceptOCamlRust
Trie nodeRecursive variant type with `Hashtbl` children`Vec<HashMap<char, usize>>` β€” arena, no boxing
Node allocation`ref` cells or GC-managed heapIndex into `Vec` β€” explicit, cache-local
BFS queue`Queue.t``VecDeque<usize>`
Output mergingList concatenation`Vec::extend` β€” in-place
Clone during BFSNot needed (GC copies)`goto[r].clone()` to avoid borrow conflict during mutation
// Aho-Corasick Multi-Pattern Matching β€” O(Ξ£|pats| + n + matches)
use std::collections::{HashMap, VecDeque};

#[derive(Default, Clone)]
struct AcNode {
    children: HashMap<u8, usize>,
    fail:     usize,
    output:   Vec<usize>, // pattern indices ending here
}

struct AhoCorasick {
    nodes:    Vec<AcNode>,
    patterns: Vec<String>,
}

impl AhoCorasick {
    fn build(patterns: Vec<&str>) -> Self {
        let pats: Vec<String> = patterns.iter().map(|s| s.to_string()).collect();
        let mut nodes = vec![AcNode::default()];

        // Build trie
        for (pid, pat) in pats.iter().enumerate() {
            let mut cur = 0;
            for &c in pat.as_bytes() {
                if !nodes[cur].children.contains_key(&c) {
                    let nid = nodes.len();
                    nodes.push(AcNode::default());
                    nodes[cur].children.insert(c, nid);
                }
                cur = nodes[cur].children[&c];
            }
            nodes[cur].output.push(pid);
        }

        // BFS for failure links
        let mut queue = VecDeque::new();
        let root_children: Vec<(u8, usize)> = nodes[0].children.iter().map(|(&c,&n)|(c,n)).collect();
        for (_, child) in root_children { nodes[child].fail = 0; queue.push_back(child); }

        while let Some(u) = queue.pop_front() {
            // Collect children first to avoid borrow issues
            let children: Vec<(u8, usize)> = nodes[u].children.iter().map(|(&c,&n)|(c,n)).collect();
            for (c, v) in children {
                // failure link for v
                let mut f = nodes[u].fail;
                loop {
                    if let Some(&nf) = nodes[f].children.get(&c) {
                        if nf != v { nodes[v].fail = nf; break; }
                    }
                    if f == 0 { nodes[v].fail = 0; break; }
                    f = nodes[f].fail;
                }
                // Merge output
                let fail_output = nodes[nodes[v].fail].output.clone();
                nodes[v].output.extend(fail_output);
                queue.push_back(v);
            }
        }

        AhoCorasick { nodes, patterns: pats }
    }

    fn search(&self, text: &str) -> Vec<(usize, usize)> {
        let mut state   = 0;
        let mut matches = Vec::new();
        for (i, &c) in text.as_bytes().iter().enumerate() {
            loop {
                if let Some(&next) = self.nodes[state].children.get(&c) {
                    state = next; break;
                }
                if state == 0 { break; }
                state = self.nodes[state].fail;
            }
            for &pid in &self.nodes[state].output {
                let start = i + 1 - self.patterns[pid].len();
                matches.push((start, pid));
            }
        }
        matches
    }
}

fn main() {
    let patterns = vec!["he", "she", "his", "hers"];
    let text     = "ushers";
    let ac       = AhoCorasick::build(patterns.clone());
    let matches  = ac.search(text);

    println!("Text: {text:?}");
    println!("Patterns: {:?}", patterns);
    for (pos, pid) in &matches {
        println!("  Found {:?} at position {pos}", patterns[*pid]);
    }
}
(* Aho-Corasick Multi-Pattern Matching β€” O(Ξ£|patterns| + n + matches) *)

let alpha = 256

type node = {
  children : int array;
  mutable fail : int;
  mutable output : int list;  (* pattern indices ending at this node *)
}

let make_node () = { children = Array.make alpha (-1); fail = 0; output = [] }

let build_automaton patterns =
  let nodes = ref [| make_node () |] in  (* root at index 0 *)

  (* Insert all patterns into trie *)
  List.iteri (fun pid pat ->
    let cur = ref 0 in
    String.iter (fun c ->
      let ci = Char.code c in
      if !nodes.(!cur).children.(ci) = -1 then begin
        !nodes.(!cur).children.(ci) <- Array.length !nodes;
        nodes := Array.append !nodes [| make_node () |]
      end;
      cur := !nodes.(!cur).children.(ci)
    ) pat;
    !nodes.(!cur).output <- pid :: !nodes.(!cur).output
  ) patterns;

  (* BFS to compute failure links *)
  let q = Queue.create () in
  for c = 0 to alpha - 1 do
    let ch = !nodes.(0).children.(c) in
    if ch = -1 then !nodes.(0).children.(c) <- 0
    else begin !nodes.(ch).fail <- 0; Queue.push ch q end
  done;
  while not (Queue.is_empty q) do
    let u = Queue.pop q in
    !nodes.(u).output <- !nodes.(u).output @
                         !nodes.(!nodes.(u).fail).output;
    for c = 0 to alpha - 1 do
      let v = !nodes.(u).children.(c) in
      if v = -1 then
        !nodes.(u).children.(c) <- !nodes.(!nodes.(u).fail).children.(c)
      else begin
        !nodes.(v).fail <- !nodes.(!nodes.(u).fail).children.(c);
        Queue.push v q
      end
    done
  done;
  !nodes

let search_ac nodes text patterns =
  let state   = ref 0 in
  let matches = ref [] in
  String.iteri (fun i c ->
    let ci = Char.code c in
    state := nodes.(!state).children.(ci);
    List.iter (fun pid ->
      let pat = List.nth patterns pid in
      matches := (i - String.length pat + 1, pid) :: !matches
    ) nodes.(!state).output
  ) text;
  List.rev !matches

let () =
  let patterns = ["he"; "she"; "his"; "hers"] in
  let text     = "ushers" in
  let nodes    = build_automaton patterns in
  let matches  = search_ac nodes text patterns in
  Printf.printf "Text: %S\n" text;
  Printf.printf "Patterns: [%s]\n" (String.concat "; " patterns);
  List.iter (fun (pos, pid) ->
    Printf.printf "  Found %S at position %d\n" (List.nth patterns pid) pos
  ) matches