162: Identifier Parser
Difficulty: โญโญโญ Level: Foundations Parse programming language identifiers โ and discover Rust's unique zero-copy `&str` trick along the way.The Problem This Solves
Parsing identifiers is the first thing any language parser does. Before you can parse `let x = 42`, you need to recognize that `x` is an identifier (starts with a letter or `_`, continues with letters, digits, or `_`) and that `let` is a keyword (same syntax, but reserved). This example brings together everything from the series โ `satisfy`, `many0`, `map`, and a new Rust-specific technique: returning `&str` directly instead of a `String`. In most parsers, the identifier is returned as an owned `String`. But Rust lets you return a slice of the original input โ a `&'a str` that borrows from the input without any allocation. This is genuinely unique to Rust's ownership model.The Intuition
An identifier rule is: start with `[a-zA-Z_]`, continue with zero or more `[a-zA-Z0-9_]`. That's `satisfy(start_pred)` followed by `many0(satisfy(continue_pred))`. Collect the results into a string, and you have an identifier parser. The zero-copy version is more interesting. Instead of building a `String` character by character, it scans the input and asks: "How many bytes from the start of this `&str` are valid identifier characters?" Then it returns `&input[..end]` โ a slice into the original string that shares its memory. Zero allocations. Reserved word filtering is a semantic layer on top. Syntactically, `let` is a valid identifier. Semantically, it's not โ your language reserves it. You parse the identifier first, then check: if it's in the reserved list, reject it with an error.How It Works in Rust
Approach 1 โ allocating with `String`:fn identifier<'a>() -> Parser<'a, String> {
Box::new(|input: &'a str| {
// First char: letter or underscore
let start = satisfy(|c| c.is_ascii_alphabetic() || c == '_', "letter or _");
let (first, rest) = start(input)?;
// Remaining chars: letter, digit, or underscore
let cont = many0(satisfy(|c| c.is_ascii_alphanumeric() || c == '_', "ident char"));
let (chars, rem) = cont(rest)?;
// Build the String by collecting
let mut s = String::with_capacity(1 + chars.len());
s.push(first);
for c in chars { s.push(c); }
Ok((s, rem))
})
}
// identifier()("hello world") = Ok(("hello", " world"))
// identifier()("_foo bar") = Ok(("_foo", " bar"))
// identifier()("x1y2z3!") = Ok(("x1y2z3", "!"))
// identifier()("123") = Err โ can't start with digit
Approach 2 โ zero-copy with `&str` slice:
fn identifier_slice<'a>() -> Parser<'a, &'a str> {
Box::new(|input: &'a str| {
let mut chars = input.chars();
match chars.next() {
Some(c) if c.is_ascii_alphabetic() || c == '_' => {
let mut end = c.len_utf8(); // start after first char
for ch in chars {
if ch.is_ascii_alphanumeric() || ch == '_' {
end += ch.len_utf8(); // accumulate byte length
} else {
break;
}
}
// Return a slice of the original input โ zero allocation
Ok((&input[..end], &input[end..]))
}
_ => Err("Expected identifier".to_string()),
}
})
}
// identifier_slice()("myVar = 5") = Ok(("myVar", " = 5"))
// The returned "myVar" is NOT a new String โ it's a pointer into the original input
The `end` variable accumulates the byte length of the identifier. `&input[..end]` is a zero-copy reference to that prefix. This is idiomatic Rust and not possible in garbage-collected languages.
Approach 3 โ reserved word rejection:
fn identifier_not_reserved<'a>(reserved: &[&str]) -> Parser<'a, String> {
let reserved: Vec<String> = reserved.iter().map(|s| s.to_string()).collect();
Box::new(move |input: &'a str| {
let (name, rest) = identifier()(input)?; // parse syntactically valid identifier
if reserved.iter().any(|r| r == &name) {
Err(format!("'{}' is a reserved word", name)) // semantic rejection
} else {
Ok((name, rest))
}
})
}
// p("myVar") = Ok(("myVar", ""))
// p("let") = Err("'let' is a reserved word")
What This Unlocks
- Language frontend parsing โ identifiers are the most common token in any language parser; this is the production-ready version.
- Zero-copy tokenization โ `identifier_slice` returns views into the original source, making it practical to tokenize large files without heap pressure.
- Keyword discrimination โ combining syntactic parsing with reserved-word filtering separates the lexer's job (what can be an identifier) from the parser's job (what should be one).
Key Differences
| Concept | OCaml | Rust | ||
|---|---|---|---|---|
| String building | `String.make 1 c ^ String.init n (fun i -> ...)` | `String::new()` + `push` | ||
| Zero-copy output | Not idiomatic (strings are GC values) | `&'a str` โ a slice into the original input, no allocation | ||
| Reserved word check | `List.mem name reserved` | `reserved.iter().any(\ | r\ | r == &name)` |
| Char classification | Manual: `c >= 'a' && c <= 'z' \ | \ | ...` | Built-in: `is_ascii_alphabetic()`, `is_ascii_alphanumeric()` |
| Byte advancement | `String.length c` (always 1 in OCaml bytes) | `c.len_utf8()` โ correct for multi-byte Unicode |
// Example 162: Identifier Parser
// Parse identifiers: letter followed by alphanumeric/underscore
type ParseResult<'a, T> = Result<(T, &'a str), String>;
type Parser<'a, T> = Box<dyn Fn(&'a str) -> ParseResult<'a, T> + 'a>;
fn satisfy<'a, F>(pred: F, desc: &str) -> Parser<'a, char>
where F: Fn(char) -> bool + 'a {
let desc = desc.to_string();
Box::new(move |input: &'a str| match input.chars().next() {
Some(c) if pred(c) => Ok((c, &input[c.len_utf8()..])),
_ => Err(format!("Expected {}", desc)),
})
}
fn many0<'a, T: 'a>(p: Parser<'a, T>) -> Parser<'a, Vec<T>> {
Box::new(move |mut input: &'a str| {
let mut v = Vec::new();
while let Ok((val, r)) = p(input) { v.push(val); input = r; }
Ok((v, input))
})
}
// ============================================================
// Approach 1: Direct identifier parser
// ============================================================
fn identifier<'a>() -> Parser<'a, String> {
Box::new(|input: &'a str| {
let start = satisfy(|c| c.is_ascii_alphabetic() || c == '_', "letter or _");
let (first, rest) = start(input)?;
let cont = many0(satisfy(|c| c.is_ascii_alphanumeric() || c == '_', "ident char"));
let (chars, rem) = cont(rest)?;
let mut s = String::with_capacity(1 + chars.len());
s.push(first);
for c in chars { s.push(c); }
Ok((s, rem))
})
}
// ============================================================
// Approach 2: Zero-copy identifier (returns &str slice)
// ============================================================
fn identifier_slice<'a>() -> Parser<'a, &'a str> {
Box::new(|input: &'a str| {
let mut chars = input.chars();
match chars.next() {
Some(c) if c.is_ascii_alphabetic() || c == '_' => {
let mut end = c.len_utf8();
for ch in chars {
if ch.is_ascii_alphanumeric() || ch == '_' {
end += ch.len_utf8();
} else {
break;
}
}
Ok((&input[..end], &input[end..]))
}
_ => Err("Expected identifier".to_string()),
}
})
}
// ============================================================
// Approach 3: With reserved word checking
// ============================================================
fn identifier_not_reserved<'a>(reserved: &[&str]) -> Parser<'a, String> {
let reserved: Vec<String> = reserved.iter().map(|s| s.to_string()).collect();
Box::new(move |input: &'a str| {
let (name, rest) = identifier()(input)?;
if reserved.iter().any(|r| r == &name) {
Err(format!("'{}' is a reserved word", name))
} else {
Ok((name, rest))
}
})
}
fn main() {
println!("=== identifier ===");
let p = identifier();
println!("{:?}", p("hello world")); // Ok(("hello", " world"))
println!("{:?}", p("_foo bar")); // Ok(("_foo", " bar"))
println!("{:?}", p("x1y2z3!")); // Ok(("x1y2z3", "!"))
println!("{:?}", p("123")); // Err
println!("\n=== identifier_slice (zero-copy) ===");
let p = identifier_slice();
println!("{:?}", p("myVar = 5")); // Ok(("myVar", " = 5"))
println!("\n=== not reserved ===");
let p = identifier_not_reserved(&["let", "if", "else", "fn"]);
println!("{:?}", p("myVar")); // Ok
println!("{:?}", p("let")); // Err
println!("\nโ All examples completed");
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_identifier_simple() {
assert_eq!(identifier()("hello world"), Ok(("hello".into(), " world")));
}
#[test]
fn test_identifier_underscore_start() {
assert_eq!(identifier()("_foo bar"), Ok(("_foo".into(), " bar")));
}
#[test]
fn test_identifier_mixed() {
assert_eq!(identifier()("x1y2z3!"), Ok(("x1y2z3".into(), "!")));
}
#[test]
fn test_identifier_single_underscore() {
assert_eq!(identifier()("_"), Ok(("_".into(), "")));
}
#[test]
fn test_identifier_starts_with_digit() {
assert!(identifier()("123").is_err());
}
#[test]
fn test_identifier_slice() {
assert_eq!(identifier_slice()("myVar = 5"), Ok(("myVar", " = 5")));
}
#[test]
fn test_not_reserved_ok() {
let p = identifier_not_reserved(&["let", "if"]);
assert_eq!(p("myVar"), Ok(("myVar".into(), "")));
}
#[test]
fn test_not_reserved_blocked() {
let p = identifier_not_reserved(&["let", "if"]);
assert!(p("let").is_err());
}
#[test]
fn test_identifier_empty() {
assert!(identifier()("").is_err());
}
}
(* Example 162: Identifier Parser *)
(* Parse identifiers: letter followed by alphanumeric/underscore *)
type 'a parse_result = ('a * string, string) result
type 'a parser = string -> 'a parse_result
let satisfy pred desc : char parser = fun input ->
if String.length input > 0 && pred input.[0] then
Ok (input.[0], String.sub input 1 (String.length input - 1))
else Error (Printf.sprintf "Expected %s" desc)
let many0 p : 'a list parser = fun input ->
let rec go acc r = match p r with Ok (v, r') -> go (v::acc) r' | Error _ -> Ok (List.rev acc, r)
in go [] input
let map f p : 'b parser = fun input ->
match p input with Ok (v, r) -> Ok (f v, r) | Error e -> Error e
let is_letter c = (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
let is_digit c = c >= '0' && c <= '9'
let is_ident_start c = is_letter c || c = '_'
let is_ident_char c = is_letter c || is_digit c || c = '_'
(* Approach 1: Direct implementation *)
let identifier : string parser = fun input ->
match satisfy is_ident_start "letter or _" input with
| Error e -> Error e
| Ok (first, rest) ->
match many0 (satisfy is_ident_char "alphanumeric or _") rest with
| Ok (chars, rem) ->
let s = String.make 1 first ^ String.init (List.length chars) (List.nth chars) in
Ok (s, rem)
| Error e -> Error e
(* Approach 2: Using combinators *)
let identifier2 : string parser =
let start = satisfy is_ident_start "letter or _" in
let cont = many0 (satisfy is_ident_char "alphanumeric or _") in
fun input ->
match start input with
| Error e -> Error e
| Ok (first, rest) ->
match cont rest with
| Ok (chars, rem) ->
let s = String.make 1 first ^ String.init (List.length chars) (List.nth chars) in
Ok (s, rem)
| Error e -> Error e
(* Approach 3: With reserved word checking *)
let reserved_words = ["let"; "in"; "if"; "then"; "else"; "match"; "with"; "fun"]
let identifier_not_reserved : string parser = fun input ->
match identifier input with
| Ok (name, rest) when not (List.mem name reserved_words) -> Ok (name, rest)
| Ok (name, _) -> Error (Printf.sprintf "'%s' is a reserved word" name)
| Error e -> Error e
(* Tests *)
let () =
assert (identifier "hello world" = Ok ("hello", " world"));
assert (identifier "_foo bar" = Ok ("_foo", " bar"));
assert (identifier "x1y2z3!" = Ok ("x1y2z3", "!"));
assert (identifier "_" = Ok ("_", ""));
assert (Result.is_error (identifier "123"));
assert (Result.is_error (identifier ""));
assert (identifier_not_reserved "myVar" = Ok ("myVar", ""));
assert (Result.is_error (identifier_not_reserved "let"));
assert (Result.is_error (identifier_not_reserved "if"));
print_endline "โ All tests passed"
๐ Detailed Comparison
Comparison: Example 162 โ Identifier Parser
Allocating identifier
OCaml:
๐ช Show OCaml equivalent
let identifier : string parser = fun input ->
match satisfy is_ident_start "letter or _" input with
| Error e -> Error e
| Ok (first, rest) ->
match many0 (satisfy is_ident_char "alphanumeric or _") rest with
| Ok (chars, rem) ->
let s = String.make 1 first ^ String.init (List.length chars) (List.nth chars) in
Ok (s, rem)
| Error e -> Error e
Rust:
fn identifier<'a>() -> Parser<'a, String> {
Box::new(|input: &'a str| {
let start = satisfy(|c| c.is_ascii_alphabetic() || c == '_', "letter or _");
let (first, rest) = start(input)?;
let cont = many0(satisfy(|c| c.is_ascii_alphanumeric() || c == '_', "ident char"));
let (chars, rem) = cont(rest)?;
let mut s = String::with_capacity(1 + chars.len());
s.push(first);
for c in chars { s.push(c); }
Ok((s, rem))
})
}Zero-copy (Rust only)
fn identifier_slice<'a>() -> Parser<'a, &'a str> {
Box::new(|input: &'a str| {
let mut chars = input.chars();
match chars.next() {
Some(c) if c.is_ascii_alphabetic() || c == '_' => {
let mut end = c.len_utf8();
for ch in chars {
if ch.is_ascii_alphanumeric() || ch == '_' { end += ch.len_utf8(); }
else { break; }
}
Ok((&input[..end], &input[end..]))
}
_ => Err("Expected identifier".to_string()),
}
})
}