154 Advanced

String Parser

Functional Programming

Tutorial

The Problem

Many grammars require matching fixed string literals: keywords like "true" and "false", operators like "+=", or delimiters like "<!--". Matching character by character would work but is verbose and error-prone for multi-character strings. The tag combinator matches an entire expected string at once, using str::starts_with for efficiency. Case-insensitive variants handle HTTP methods, SQL keywords, and HTML tag names where casing is not significant.

🎯 Learning Outcomes

• Implement tag (exact string match) and tag_no_case (case-insensitive match)

• Understand why string parsers return &'a str (a slice of the input) rather than String (an allocation)

• See how string parsers combine with other combinators to parse complex patterns

• Learn the lifetime relationships between the parser, input, and returned slice

Code Example

fn tag<'a>(expected: &str) -> Parser<'a, &'a str> {
    let expected_owned = expected.to_string();
    Box::new(move |input: &'a str| {
        if input.starts_with(&expected_owned) {
            let rest = &input[expected_owned.len()..];
            Ok((&input[..expected_owned.len()], rest))
        } else {
            Err(format!("Expected \"{}\"", expected_owned))
        }
    })
}

let tag (expected : string) : string parser = fun input ->
  let len = String.length expected in
  if String.length input >= len && String.sub input 0 len = expected then
    Ok (expected, String.sub input len (String.length input - len))
  else
    Error (Printf.sprintf "Expected \"%s\"" expected)

Key Differences

Zero-copy returns: Rust's tag returns a &'a str slice — no allocation; OCaml typically returns a new string (allocation under GC).

Lifetime tracking: Rust's 'a lifetime annotation enforces the validity relationship between input and returned slice; OCaml has no lifetime concept.

Case folding: tag_no_case in Rust must be careful with Unicode case folding (.to_lowercase() allocates); OCaml's String.lowercase_ascii handles only ASCII.

Error messages: Both should include the expected string in the error; convention differs between libraries.

OCaml Approach

OCaml's angstrom provides string : string -> string t and string_ci : string -> string t. Because OCaml strings are immutable byte sequences, string_ci may allocate for the lowercased comparison. The returned value is a newly allocated string, not a slice of the input — allocation is less of a concern under GC.

Full Source

#![allow(clippy::all)]
// Example 154: String Parser
// Parse exact string literals

type ParseResult<'a, T> = Result<(T, &'a str), String>;
type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;

// ============================================================
// Approach 1: Parse exact string (tag)
// ============================================================

fn tag<'a>(expected: &str) -> Parser<'a, &'a str> {
    let expected_owned = expected.to_string();
    Box::new(move |input: &'a str| {
        if input.starts_with(&expected_owned) {
            let rest = &input[expected_owned.len()..];
            Ok((&input[..expected_owned.len()], rest))
        } else {
            Err(format!("Expected \"{}\"", expected_owned))
        }
    })
}

// ============================================================
// Approach 2: Case-insensitive string match
// ============================================================

fn tag_no_case<'a>(expected: &str) -> Parser<'a, &'a str> {
    let expected_lower = expected.to_lowercase();
    let len = expected.len();
    Box::new(move |input: &'a str| {
        if input.len() >= len && input[..len].to_lowercase() == expected_lower {
            Ok((&input[..len], &input[len..]))
        } else {
            Err(format!(
                "Expected \"{}\" (case insensitive)",
                expected_lower
            ))
        }
    })
}

// ============================================================
// Approach 3: Build from character-by-character matching
// ============================================================

fn string_from_chars<'a>(expected: &str) -> Parser<'a, String> {
    let expected = expected.to_string();
    Box::new(move |input: &'a str| {
        let mut remaining = input;
        for expected_char in expected.chars() {
            match remaining.chars().next() {
                Some(c) if c == expected_char => {
                    remaining = &remaining[c.len_utf8()..];
                }
                Some(c) => return Err(format!("Expected '{}', got '{}'", expected_char, c)),
                None => return Err(format!("Expected '{}', got EOF", expected_char)),
            }
        }
        Ok((expected.clone(), remaining))
    })
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_tag_match() {
        let p = tag("hello");
        assert_eq!(p("hello world"), Ok(("hello", " world")));
    }

    #[test]
    fn test_tag_exact() {
        let p = tag("hello");
        assert_eq!(p("hello"), Ok(("hello", "")));
    }

    #[test]
    fn test_tag_no_match() {
        let p = tag("hello");
        assert!(p("world").is_err());
    }

    #[test]
    fn test_tag_too_short() {
        let p = tag("hello");
        assert!(p("hel").is_err());
    }

    #[test]
    fn test_tag_no_case_upper() {
        let p = tag_no_case("Hello");
        assert_eq!(p("HELLO world"), Ok(("HELLO", " world")));
    }

    #[test]
    fn test_tag_no_case_mixed() {
        let p = tag_no_case("hello");
        assert_eq!(p("HeLLo!"), Ok(("HeLLo", "!")));
    }

    #[test]
    fn test_string_from_chars() {
        let p = string_from_chars("abc");
        assert_eq!(p("abcdef"), Ok(("abc".to_string(), "def")));
    }

    #[test]
    fn test_string_from_chars_fail() {
        let p = string_from_chars("abc");
        assert!(p("axc").is_err());
    }

    #[test]
    fn test_tag_empty_string() {
        let p = tag("");
        assert_eq!(p("anything"), Ok(("", "anything")));
    }
}

(* Example 154: String Parser *)
(* Parse exact string literals *)

type 'a parse_result = ('a * string, string) result
type 'a parser = string -> 'a parse_result

(* Approach 1: Parse exact string *)
let tag (expected : string) : string parser = fun input ->
  let len = String.length expected in
  if String.length input >= len && String.sub input 0 len = expected then
    Ok (expected, String.sub input len (String.length input - len))
  else
    Error (Printf.sprintf "Expected \"%s\"" expected)

(* Approach 2: Case-insensitive string match *)
let tag_no_case (expected : string) : string parser = fun input ->
  let len = String.length expected in
  if String.length input >= len &&
     String.lowercase_ascii (String.sub input 0 len) = String.lowercase_ascii expected then
    Ok (String.sub input 0 len, String.sub input len (String.length input - len))
  else
    Error (Printf.sprintf "Expected \"%s\" (case insensitive)" expected)

(* Approach 3: Build string parser from char parsers *)
let char_parser (c : char) : char parser = fun input ->
  if String.length input > 0 && input.[0] = c then
    Ok (c, String.sub input 1 (String.length input - 1))
  else
    Error (Printf.sprintf "Expected '%c'" c)

let string_from_chars (s : string) : string parser = fun input ->
  let len = String.length s in
  let rec go i remaining =
    if i >= len then Ok (s, remaining)
    else
      match char_parser s.[i] remaining with
      | Ok (_, rest) -> go (i + 1) rest
      | Error e -> Error e
  in
  go 0 input

(* Tests *)
let () =
  assert (tag "hello" "hello world" = Ok ("hello", " world"));
  assert (tag "hello" "hello" = Ok ("hello", ""));
  assert (Result.is_error (tag "hello" "world"));
  assert (Result.is_error (tag "hello" "hel"));

  assert (tag_no_case "Hello" "HELLO world" = Ok ("HELLO", " world"));
  assert (tag_no_case "hello" "HeLLo!" = Ok ("HeLLo", "!"));

  assert (string_from_chars "abc" "abcdef" = Ok ("abc", "def"));
  assert (Result.is_error (string_from_chars "abc" "axc"));

  print_endline "✓ All tests passed"

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_tag_match() {
        let p = tag("hello");
        assert_eq!(p("hello world"), Ok(("hello", " world")));
    }

    #[test]
    fn test_tag_exact() {
        let p = tag("hello");
        assert_eq!(p("hello"), Ok(("hello", "")));
    }

    #[test]
    fn test_tag_no_match() {
        let p = tag("hello");
        assert!(p("world").is_err());
    }

    #[test]
    fn test_tag_too_short() {
        let p = tag("hello");
        assert!(p("hel").is_err());
    }

    #[test]
    fn test_tag_no_case_upper() {
        let p = tag_no_case("Hello");
        assert_eq!(p("HELLO world"), Ok(("HELLO", " world")));
    }

    #[test]
    fn test_tag_no_case_mixed() {
        let p = tag_no_case("hello");
        assert_eq!(p("HeLLo!"), Ok(("HeLLo", "!")));
    }

    #[test]
    fn test_string_from_chars() {
        let p = string_from_chars("abc");
        assert_eq!(p("abcdef"), Ok(("abc".to_string(), "def")));
    }

    #[test]
    fn test_string_from_chars_fail() {
        let p = string_from_chars("abc");
        assert!(p("axc").is_err());
    }

    #[test]
    fn test_tag_empty_string() {
        let p = tag("");
        assert_eq!(p("anything"), Ok(("", "anything")));
    }
}

Deep Comparison

Comparison: Example 154 — String Parser

tag (exact match)

OCaml:

let tag (expected : string) : string parser = fun input ->
  let len = String.length expected in
  if String.length input >= len && String.sub input 0 len = expected then
    Ok (expected, String.sub input len (String.length input - len))
  else
    Error (Printf.sprintf "Expected \"%s\"" expected)

Rust:

fn tag<'a>(expected: &str) -> Parser<'a, &'a str> {
    let expected_owned = expected.to_string();
    Box::new(move |input: &'a str| {
        if input.starts_with(&expected_owned) {
            let rest = &input[expected_owned.len()..];
            Ok((&input[..expected_owned.len()], rest))
        } else {
            Err(format!("Expected \"{}\"", expected_owned))
        }
    })
}

tag_no_case

OCaml:

let tag_no_case (expected : string) : string parser = fun input ->
  let len = String.length expected in
  if String.length input >= len &&
     String.lowercase_ascii (String.sub input 0 len) = String.lowercase_ascii expected then
    Ok (String.sub input 0 len, String.sub input len (String.length input - len))
  else
    Error (Printf.sprintf "Expected \"%s\" (case insensitive)" expected)

Rust:

fn tag_no_case<'a>(expected: &str) -> Parser<'a, &'a str> {
    let expected_lower = expected.to_lowercase();
    let len = expected.len();
    Box::new(move |input: &'a str| {
        if input.len() >= len && input[..len].to_lowercase() == expected_lower {
            Ok((&input[..len], &input[len..]))
        } else {
            Err(format!("Expected \"{}\" (case insensitive)", expected_lower))
        }
    })
}

Exercises

Implement tag_byte_slice(expected: &[u8]) -> Parser<&[u8]> that works on raw bytes instead of UTF-8 strings.

Write a bool_parser() -> Parser<bool> using tag("true") and tag("false") combined with choice.

Implement keyword(kw: &str) -> Parser<&str> that matches a string followed by a non-identifier character (word boundary).

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust