๐Ÿฆ€ Functional Rust

162: Identifier Parser

Difficulty: โญโญโญ Level: Foundations Parse programming language identifiers โ€” and discover Rust's unique zero-copy `&str` trick along the way.

The Problem This Solves

Parsing identifiers is the first thing any language parser does. Before you can parse `let x = 42`, you need to recognize that `x` is an identifier (starts with a letter or `_`, continues with letters, digits, or `_`) and that `let` is a keyword (same syntax, but reserved). This example brings together everything from the series โ€” `satisfy`, `many0`, `map`, and a new Rust-specific technique: returning `&str` directly instead of a `String`. In most parsers, the identifier is returned as an owned `String`. But Rust lets you return a slice of the original input โ€” a `&'a str` that borrows from the input without any allocation. This is genuinely unique to Rust's ownership model.

The Intuition

An identifier rule is: start with `[a-zA-Z_]`, continue with zero or more `[a-zA-Z0-9_]`. That's `satisfy(start_pred)` followed by `many0(satisfy(continue_pred))`. Collect the results into a string, and you have an identifier parser. The zero-copy version is more interesting. Instead of building a `String` character by character, it scans the input and asks: "How many bytes from the start of this `&str` are valid identifier characters?" Then it returns `&input[..end]` โ€” a slice into the original string that shares its memory. Zero allocations. Reserved word filtering is a semantic layer on top. Syntactically, `let` is a valid identifier. Semantically, it's not โ€” your language reserves it. You parse the identifier first, then check: if it's in the reserved list, reject it with an error.

How It Works in Rust

Approach 1 โ€” allocating with `String`:
fn identifier<'a>() -> Parser<'a, String> {
 Box::new(|input: &'a str| {
     // First char: letter or underscore
     let start = satisfy(|c| c.is_ascii_alphabetic() || c == '_', "letter or _");
     let (first, rest) = start(input)?;

     // Remaining chars: letter, digit, or underscore
     let cont = many0(satisfy(|c| c.is_ascii_alphanumeric() || c == '_', "ident char"));
     let (chars, rem) = cont(rest)?;

     // Build the String by collecting
     let mut s = String::with_capacity(1 + chars.len());
     s.push(first);
     for c in chars { s.push(c); }
     Ok((s, rem))
 })
}
// identifier()("hello world") = Ok(("hello", " world"))
// identifier()("_foo bar")    = Ok(("_foo", " bar"))
// identifier()("x1y2z3!")     = Ok(("x1y2z3", "!"))
// identifier()("123")         = Err โ€” can't start with digit
Approach 2 โ€” zero-copy with `&str` slice:
fn identifier_slice<'a>() -> Parser<'a, &'a str> {
 Box::new(|input: &'a str| {
     let mut chars = input.chars();
     match chars.next() {
         Some(c) if c.is_ascii_alphabetic() || c == '_' => {
             let mut end = c.len_utf8();  // start after first char
             for ch in chars {
                 if ch.is_ascii_alphanumeric() || ch == '_' {
                     end += ch.len_utf8();  // accumulate byte length
                 } else {
                     break;
                 }
             }
             // Return a slice of the original input โ€” zero allocation
             Ok((&input[..end], &input[end..]))
         }
         _ => Err("Expected identifier".to_string()),
     }
 })
}
// identifier_slice()("myVar = 5") = Ok(("myVar", " = 5"))
// The returned "myVar" is NOT a new String โ€” it's a pointer into the original input
The `end` variable accumulates the byte length of the identifier. `&input[..end]` is a zero-copy reference to that prefix. This is idiomatic Rust and not possible in garbage-collected languages. Approach 3 โ€” reserved word rejection:
fn identifier_not_reserved<'a>(reserved: &[&str]) -> Parser<'a, String> {
 let reserved: Vec<String> = reserved.iter().map(|s| s.to_string()).collect();
 Box::new(move |input: &'a str| {
     let (name, rest) = identifier()(input)?;  // parse syntactically valid identifier
     if reserved.iter().any(|r| r == &name) {
         Err(format!("'{}' is a reserved word", name))  // semantic rejection
     } else {
         Ok((name, rest))
     }
 })
}
// p("myVar") = Ok(("myVar", ""))
// p("let")   = Err("'let' is a reserved word")

What This Unlocks

Key Differences

ConceptOCamlRust
String building`String.make 1 c ^ String.init n (fun i -> ...)``String::new()` + `push`
Zero-copy outputNot idiomatic (strings are GC values)`&'a str` โ€” a slice into the original input, no allocation
Reserved word check`List.mem name reserved``reserved.iter().any(\r\r == &name)`
Char classificationManual: `c >= 'a' && c <= 'z' \\...`Built-in: `is_ascii_alphabetic()`, `is_ascii_alphanumeric()`
Byte advancement`String.length c` (always 1 in OCaml bytes)`c.len_utf8()` โ€” correct for multi-byte Unicode
// Example 162: Identifier Parser
// Parse identifiers: letter followed by alphanumeric/underscore

type ParseResult<'a, T> = Result<(T, &'a str), String>;
type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;

fn satisfy<'a, F>(pred: F, desc: &str) -> Parser<'a, char>
where F: Fn(char) -> bool + 'a {
    let desc = desc.to_string();
    Box::new(move |input: &'a str| match input.chars().next() {
        Some(c) if pred(c) => Ok((c, &input[c.len_utf8()..])),
        _ => Err(format!("Expected {}", desc)),
    })
}

fn many0<'a, T: 'a>(p: Parser<'a, T>) -> Parser<'a, Vec<T>> {
    Box::new(move |mut input: &'a str| {
        let mut v = Vec::new();
        while let Ok((val, r)) = p(input) { v.push(val); input = r; }
        Ok((v, input))
    })
}

// ============================================================
// Approach 1: Direct identifier parser
// ============================================================

fn identifier<'a>() -> Parser<'a, String> {
    Box::new(|input: &'a str| {
        let start = satisfy(|c| c.is_ascii_alphabetic() || c == '_', "letter or _");
        let (first, rest) = start(input)?;
        let cont = many0(satisfy(|c| c.is_ascii_alphanumeric() || c == '_', "ident char"));
        let (chars, rem) = cont(rest)?;
        let mut s = String::with_capacity(1 + chars.len());
        s.push(first);
        for c in chars { s.push(c); }
        Ok((s, rem))
    })
}

// ============================================================
// Approach 2: Zero-copy identifier (returns &str slice)
// ============================================================

fn identifier_slice<'a>() -> Parser<'a, &'a str> {
    Box::new(|input: &'a str| {
        let mut chars = input.chars();
        match chars.next() {
            Some(c) if c.is_ascii_alphabetic() || c == '_' => {
                let mut end = c.len_utf8();
                for ch in chars {
                    if ch.is_ascii_alphanumeric() || ch == '_' {
                        end += ch.len_utf8();
                    } else {
                        break;
                    }
                }
                Ok((&input[..end], &input[end..]))
            }
            _ => Err("Expected identifier".to_string()),
        }
    })
}

// ============================================================
// Approach 3: With reserved word checking
// ============================================================

fn identifier_not_reserved<'a>(reserved: &[&str]) -> Parser<'a, String> {
    let reserved: Vec<String> = reserved.iter().map(|s| s.to_string()).collect();
    Box::new(move |input: &'a str| {
        let (name, rest) = identifier()(input)?;
        if reserved.iter().any(|r| r == &name) {
            Err(format!("'{}' is a reserved word", name))
        } else {
            Ok((name, rest))
        }
    })
}

fn main() {
    println!("=== identifier ===");
    let p = identifier();
    println!("{:?}", p("hello world"));  // Ok(("hello", " world"))
    println!("{:?}", p("_foo bar"));     // Ok(("_foo", " bar"))
    println!("{:?}", p("x1y2z3!"));      // Ok(("x1y2z3", "!"))
    println!("{:?}", p("123"));          // Err

    println!("\n=== identifier_slice (zero-copy) ===");
    let p = identifier_slice();
    println!("{:?}", p("myVar = 5"));    // Ok(("myVar", " = 5"))

    println!("\n=== not reserved ===");
    let p = identifier_not_reserved(&["let", "if", "else", "fn"]);
    println!("{:?}", p("myVar"));  // Ok
    println!("{:?}", p("let"));    // Err

    println!("\nโœ“ All examples completed");
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_identifier_simple() {
        assert_eq!(identifier()("hello world"), Ok(("hello".into(), " world")));
    }

    #[test]
    fn test_identifier_underscore_start() {
        assert_eq!(identifier()("_foo bar"), Ok(("_foo".into(), " bar")));
    }

    #[test]
    fn test_identifier_mixed() {
        assert_eq!(identifier()("x1y2z3!"), Ok(("x1y2z3".into(), "!")));
    }

    #[test]
    fn test_identifier_single_underscore() {
        assert_eq!(identifier()("_"), Ok(("_".into(), "")));
    }

    #[test]
    fn test_identifier_starts_with_digit() {
        assert!(identifier()("123").is_err());
    }

    #[test]
    fn test_identifier_slice() {
        assert_eq!(identifier_slice()("myVar = 5"), Ok(("myVar", " = 5")));
    }

    #[test]
    fn test_not_reserved_ok() {
        let p = identifier_not_reserved(&["let", "if"]);
        assert_eq!(p("myVar"), Ok(("myVar".into(), "")));
    }

    #[test]
    fn test_not_reserved_blocked() {
        let p = identifier_not_reserved(&["let", "if"]);
        assert!(p("let").is_err());
    }

    #[test]
    fn test_identifier_empty() {
        assert!(identifier()("").is_err());
    }
}
(* Example 162: Identifier Parser *)
(* Parse identifiers: letter followed by alphanumeric/underscore *)

type 'a parse_result = ('a * string, string) result
type 'a parser = string -> 'a parse_result

let satisfy pred desc : char parser = fun input ->
  if String.length input > 0 && pred input.[0] then
    Ok (input.[0], String.sub input 1 (String.length input - 1))
  else Error (Printf.sprintf "Expected %s" desc)

let many0 p : 'a list parser = fun input ->
  let rec go acc r = match p r with Ok (v, r') -> go (v::acc) r' | Error _ -> Ok (List.rev acc, r)
  in go [] input

let map f p : 'b parser = fun input ->
  match p input with Ok (v, r) -> Ok (f v, r) | Error e -> Error e

let is_letter c = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
let is_digit c = c >= '0' && c <= '9'
let is_ident_start c = is_letter c || c = '_'
let is_ident_char c = is_letter c || is_digit c || c = '_'

(* Approach 1: Direct implementation *)
let identifier : string parser = fun input ->
  match satisfy is_ident_start "letter or _" input with
  | Error e -> Error e
  | Ok (first, rest) ->
    match many0 (satisfy is_ident_char "alphanumeric or _") rest with
    | Ok (chars, rem) ->
      let s = String.make 1 first ^ String.init (List.length chars) (List.nth chars) in
      Ok (s, rem)
    | Error e -> Error e

(* Approach 2: Using combinators *)
let identifier2 : string parser =
  let start = satisfy is_ident_start "letter or _" in
  let cont = many0 (satisfy is_ident_char "alphanumeric or _") in
  fun input ->
    match start input with
    | Error e -> Error e
    | Ok (first, rest) ->
      match cont rest with
      | Ok (chars, rem) ->
        let s = String.make 1 first ^ String.init (List.length chars) (List.nth chars) in
        Ok (s, rem)
      | Error e -> Error e

(* Approach 3: With reserved word checking *)
let reserved_words = ["let"; "in"; "if"; "then"; "else"; "match"; "with"; "fun"]

let identifier_not_reserved : string parser = fun input ->
  match identifier input with
  | Ok (name, rest) when not (List.mem name reserved_words) -> Ok (name, rest)
  | Ok (name, _) -> Error (Printf.sprintf "'%s' is a reserved word" name)
  | Error e -> Error e

(* Tests *)
let () =
  assert (identifier "hello world" = Ok ("hello", " world"));
  assert (identifier "_foo bar" = Ok ("_foo", " bar"));
  assert (identifier "x1y2z3!" = Ok ("x1y2z3", "!"));
  assert (identifier "_" = Ok ("_", ""));
  assert (Result.is_error (identifier "123"));
  assert (Result.is_error (identifier ""));

  assert (identifier_not_reserved "myVar" = Ok ("myVar", ""));
  assert (Result.is_error (identifier_not_reserved "let"));
  assert (Result.is_error (identifier_not_reserved "if"));

  print_endline "โœ“ All tests passed"

๐Ÿ“Š Detailed Comparison

Comparison: Example 162 โ€” Identifier Parser

Allocating identifier

OCaml:

๐Ÿช Show OCaml equivalent
let identifier : string parser = fun input ->
match satisfy is_ident_start "letter or _" input with
| Error e -> Error e
| Ok (first, rest) ->
 match many0 (satisfy is_ident_char "alphanumeric or _") rest with
 | Ok (chars, rem) ->
   let s = String.make 1 first ^ String.init (List.length chars) (List.nth chars) in
   Ok (s, rem)
 | Error e -> Error e

Rust:

fn identifier<'a>() -> Parser<'a, String> {
 Box::new(|input: &'a str| {
     let start = satisfy(|c| c.is_ascii_alphabetic() || c == '_', "letter or _");
     let (first, rest) = start(input)?;
     let cont = many0(satisfy(|c| c.is_ascii_alphanumeric() || c == '_', "ident char"));
     let (chars, rem) = cont(rest)?;
     let mut s = String::with_capacity(1 + chars.len());
     s.push(first);
     for c in chars { s.push(c); }
     Ok((s, rem))
 })
}

Zero-copy (Rust only)

fn identifier_slice<'a>() -> Parser<'a, &'a str> {
 Box::new(|input: &'a str| {
     let mut chars = input.chars();
     match chars.next() {
         Some(c) if c.is_ascii_alphabetic() || c == '_' => {
             let mut end = c.len_utf8();
             for ch in chars {
                 if ch.is_ascii_alphanumeric() || ch == '_' { end += ch.len_utf8(); }
                 else { break; }
             }
             Ok((&input[..end], &input[end..]))
         }
         _ => Err("Expected identifier".to_string()),
     }
 })
}