๐Ÿฆ€ Functional Rust

815: Boyer-Moore-Horspool String Search

Difficulty: 4 Level: Advanced Sublinear average-case string search: skip whole pattern-lengths at a time using a precomputed bad-character table.

The Problem This Solves

KMP achieves O(n + m) by never backing up โ€” but it still examines every character of the text at least once. Boyer-Moore goes further: it can skip entire sections of text without examining them, achieving O(n/m) average case. For long patterns on large texts (e.g., searching source code for a 20-character identifier), this is a dramatic practical speedup. Boyer-Moore-Horspool is the simplified variant that keeps only the bad-character heuristic. It's easier to implement correctly and fast enough for the overwhelming majority of real use cases. The full Boyer-Moore adds the good-suffix rule for additional worst-case guarantees, but BMH is what you reach for in practice: file search tools like `grep` use Boyer-Moore variants internally. The trade-off: BMH has O(nร—m) worst case (repeated single-character patterns in repeated text), while KMP is always O(n + m). Choose BMH when patterns are long and the alphabet is large (English text, source code), choose KMP when patterns may be repetitive or worst-case guarantees matter.

The Intuition

Compare right-to-left within the window. When a mismatch occurs at the rightmost position, look at the text character there and shift the pattern so that character aligns with its last occurrence in the pattern. The shift table `skip[c]` is built once in O(m): for each pattern character at position `i` (not the last), `skip[c] = m - 1 - i`. The default shift for characters not in the pattern is `m` โ€” skip the entire pattern width. O(n/m) average because each shift advances by roughly `m/2` on random text. OCaml builds the same table with `Array.make 256 m`; Rust uses `[usize; 256]` on the stack for cache-friendly constant-time access.

How It Works in Rust

// Build the shift (bad-character) table in O(m)
fn build_shift(pattern: &[u8]) -> [usize; 256] {
 let m = pattern.len();
 let mut shift = [m; 256];  // Default: shift by full pattern length
 // All characters except the last get a shift based on their rightmost position
 for i in 0..m.saturating_sub(1) {
     shift[pattern[i] as usize] = m - 1 - i;
 }
 shift
}

fn bmh_search(text: &str, pattern: &str) -> Vec<usize> {
 let (t, p) = (text.as_bytes(), pattern.as_bytes());
 let (n, m) = (t.len(), p.len());
 if m == 0 || m > n { return vec![]; }

 let shift = build_shift(p);
 let mut matches = Vec::new();
 let mut pos = 0;

 while pos + m <= n {
     let mut j = m;
     // Compare right-to-left: first mismatch triggers the shift
     while j > 0 && p[j - 1] == t[pos + j - 1] { j -= 1; }
     if j == 0 { matches.push(pos); }
     // Always shift by the bad-character value of the rightmost text char
     pos += shift[t[pos + m - 1] as usize];
 }
 matches
}
The `[usize; 256]` stack array is a key Rust performance idiom: fixed-size, no heap allocation, fits in L1 cache.

What This Unlocks

Key Differences

ConceptOCamlRust
Shift table`Array.make 256 m``[usize; 256]` โ€” stack-allocated, no heap
Char to index`Char.code c``c as usize`
Right-to-left compare`for j = m-1 downto 0``while j > 0 { j -= 1; }`
Shift on mismatch`shift.(Char.code text.[pos+m-1])``shift[t[pos + m - 1] as usize]`
Pattern not foundReturns empty listReturns empty `Vec`
// Boyer-Moore-Horspool string search โ€” sublinear average case

fn build_shift(pattern: &[u8]) -> [usize; 256] {
    let m = pattern.len();
    let mut shift = [m; 256];
    for i in 0..m.saturating_sub(1) {
        shift[pattern[i] as usize] = m - 1 - i;
    }
    shift
}

fn bmh_search(text: &str, pattern: &str) -> Vec<usize> {
    let t = text.as_bytes();
    let p = pattern.as_bytes();
    let (n, m) = (t.len(), p.len());
    if m == 0 || m > n { return vec![]; }

    let shift   = build_shift(p);
    let mut matches = Vec::new();
    let mut pos = 0;

    while pos + m <= n {
        // Compare right-to-left
        let mut j = m;
        while j > 0 && p[j - 1] == t[pos + j - 1] { j -= 1; }
        if j == 0 { matches.push(pos); }
        pos += shift[t[pos + m - 1] as usize];
    }
    matches
}

fn main() {
    let text = "ABAAABCDABABCABAB";
    let pat  = "ABAB";
    println!("Text:    {text:?}");
    println!("Pattern: {pat:?}");
    println!("Matches: {:?}", bmh_search(text, pat));

    println!("\n\"aaa\" in \"aaaaaaaaaa\": {:?}", bmh_search("aaaaaaaaaa", "aaa"));
    println!("No match: {:?}", bmh_search("abcdef", "xyz"));
}
(* Boyer-Moore-Horspool string search *)

let build_shift pattern =
  let m     = String.length pattern in
  let shift = Array.make 256 m in  (* default: shift by full pattern length *)
  for i = 0 to m - 2 do
    shift.(Char.code pattern.[i]) <- m - 1 - i
  done;
  shift

let bmh_search text pattern =
  let n = String.length text in
  let m = String.length pattern in
  if m = 0 then []
  else if m > n then []
  else begin
    let shift   = build_shift pattern in
    let matches = ref [] in
    let pos     = ref 0 in
    while !pos <= n - m do
      (* Compare right to left *)
      let j = ref (m - 1) in
      while !j >= 0 && pattern.[!j] = text.[!pos + !j] do
        decr j
      done;
      if !j < 0 then
        matches := !pos :: !matches;
      pos := !pos + shift.(Char.code text.[!pos + m - 1])
    done;
    List.rev !matches
  end

let () =
  let text = "ABAAABCDABABCABAB" in
  let pat  = "ABAB" in
  Printf.printf "Text:    %S\n" text;
  Printf.printf "Pattern: %S\n" pat;
  Printf.printf "Matches: [%s]\n"
    (String.concat "; " (List.map string_of_int (bmh_search text pat)));

  let text2 = "aaaaaaaaaa" in
  let pat2  = "aaa" in
  Printf.printf "\nText: %S  Pattern: %S\n" text2 pat2;
  Printf.printf "Matches: [%s]\n"
    (String.concat "; " (List.map string_of_int (bmh_search text2 pat2)))