String.split_on_char — Tokenize a String
Tutorial
The Problem
Split a string on a delimiter character to produce a list of tokens, then filter out empty tokens that arise from consecutive delimiters or leading and trailing whitespace. This operation is fundamental to parsing CSV lines, tokenizing user input, and processing any character-delimited data format. The example covers the primary string splitting primitive in each language and shows how filtering empty strings integrates naturally into the functional pipeline.
🎯 Learning Outcomes
String.split_on_char ',' s maps to Rust's s.split(',') — both split on a single character delimiter and preserve empty strings between consecutive delimitersList.filter (fun s -> s <> "") tokens maps to Rust's .filter(|s| !s.is_empty()) — the same concept expressed as a list operation vs. an iterator adaptersplit returns a lazy Split iterator rather than an allocated Vec, and how to materialize it with .collect::<Vec<_>>() when a concrete collection is neededstr::split(char) for single-character delimiters and str::split_whitespace() for collapsing any run of whitespace — both useful, with different semanticsList.iteri maps to .iter().enumerate() in Rust for index-aware iteration, a pattern that appears repeatedly when processing tokenized dataCode Example
let csv_line = "Alice,30,Engineer,Amsterdam";
let fields: Vec<&str> = csv_line.split(',').collect();
for (i, f) in fields.iter().enumerate() {
println!("Field {}: {}", i, f);
}
let words: Vec<&str> = " hello world "
.split_whitespace()
.collect();Key Differences
String.split_on_char immediately allocates and returns a string list; Rust's str::split returns a lazy iterator — allocation happens only when .collect() is called, and if you only need to iterate, you can avoid allocation entirely.String.split_on_char delimiter string (delimiter first, string second — pipe-friendly); Rust: string.split(delimiter) (method on the string, delimiter as argument) — the string is the receiver in Rust's method call syntax.List.filter (fun s -> s <> ""); Rust uses .filter(|s| !s.is_empty()) as an iterator adapter, or uses the separate split_whitespace() method which collapses runs automatically.string values in the list; Rust's split returns &str slices that borrow from the original string, which is more memory-efficient but means the resulting slices cannot outlive the source string without cloning.OCaml Approach
OCaml's String.split_on_char : char -> string -> string list (added in OCaml 4.04) takes the delimiter character first and the string second, returning a string list. It preserves empty strings between consecutive delimiters, so String.split_on_char ',' "a,,b" returns ["a"; ""; "b"]. Empty tokens from surrounding whitespace are removed with List.filter (fun s -> s <> "") tokens. Index-aware printing uses List.iteri (fun i f -> ...) fields, which passes the zero-based index alongside each element. All operations produce new values; no mutation occurs.
Full Source
#![allow(dead_code)]
//! String.split_on_char — Tokenize a String
//! See example.ml for OCaml reference
//!
//! OCaml's `String.split_on_char delim s` splits a string on a single character delimiter.
//! Rust's `str::split(delim)` is the direct equivalent — both preserve empty strings between
//! consecutive delimiters.
/// Idiomatic Rust: split a string on a delimiter character, preserving empty tokens.
/// Mirrors OCaml: `String.split_on_char delim s`
pub fn split_on_char(s: &str, delim: char) -> Vec<&str> {
s.split(delim).collect()
}
/// Split and filter out empty tokens.
/// Mirrors OCaml: `List.filter (fun s -> s <> "") (String.split_on_char delim s)`
pub fn split_nonempty(s: &str, delim: char) -> Vec<&str> {
s.split(delim).filter(|t| !t.is_empty()).collect()
}
/// Split on whitespace, dropping empty tokens (equivalent to OCaml's `String.split_on_char ' '`
/// followed by filtering, but handles any run of whitespace in a single step).
pub fn tokenize(s: &str) -> Vec<&str> {
s.split_whitespace().collect()
}
/// Parse a CSV record: split on commas and trim each field.
pub fn parse_csv(line: &str) -> Vec<&str> {
line.split(',').map(str::trim).collect()
}
/// Split only on the first occurrence of `delim`.
/// Returns `(before, after)` or `None` if delimiter not found.
/// Uses `str::split_once` — the idiomatic Rust approach.
pub fn split_first_occurrence(s: &str, delim: char) -> Option<(&str, &str)> {
s.split_once(delim)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_split_empty_string() {
// Splitting an empty string on any delimiter gives one empty token.
assert_eq!(split_on_char("", ','), vec![""]);
}
#[test]
fn test_split_single_field() {
assert_eq!(split_on_char("hello", ','), vec!["hello"]);
}
#[test]
fn test_split_csv_line() {
let fields = split_on_char("Alice,30,Engineer,Amsterdam", ',');
assert_eq!(fields, vec!["Alice", "30", "Engineer", "Amsterdam"]);
}
#[test]
fn test_split_preserves_empty_tokens() {
// Consecutive delimiters produce an empty string between them.
let result = split_on_char("a,,b", ',');
assert_eq!(result, vec!["a", "", "b"]);
}
#[test]
fn test_split_nonempty_removes_empty_tokens() {
let result = split_nonempty(" hello world ", ' ');
assert_eq!(result, vec!["hello", "world"]);
}
#[test]
fn test_tokenize_whitespace() {
assert_eq!(tokenize(" hello world "), vec!["hello", "world"]);
}
#[test]
fn test_parse_csv_trims_whitespace() {
let result = parse_csv(" Alice , 30 , Engineer ");
assert_eq!(result, vec!["Alice", "30", "Engineer"]);
}
#[test]
fn test_split_first_occurrence_found() {
assert_eq!(
split_first_occurrence("key=value=extra", '='),
Some(("key", "value=extra"))
);
}
#[test]
fn test_split_first_occurrence_not_found() {
assert_eq!(split_first_occurrence("no-delimiter-here", '='), None);
}
}#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_split_empty_string() {
// Splitting an empty string on any delimiter gives one empty token.
assert_eq!(split_on_char("", ','), vec![""]);
}
#[test]
fn test_split_single_field() {
assert_eq!(split_on_char("hello", ','), vec!["hello"]);
}
#[test]
fn test_split_csv_line() {
let fields = split_on_char("Alice,30,Engineer,Amsterdam", ',');
assert_eq!(fields, vec!["Alice", "30", "Engineer", "Amsterdam"]);
}
#[test]
fn test_split_preserves_empty_tokens() {
// Consecutive delimiters produce an empty string between them.
let result = split_on_char("a,,b", ',');
assert_eq!(result, vec!["a", "", "b"]);
}
#[test]
fn test_split_nonempty_removes_empty_tokens() {
let result = split_nonempty(" hello world ", ' ');
assert_eq!(result, vec!["hello", "world"]);
}
#[test]
fn test_tokenize_whitespace() {
assert_eq!(tokenize(" hello world "), vec!["hello", "world"]);
}
#[test]
fn test_parse_csv_trims_whitespace() {
let result = parse_csv(" Alice , 30 , Engineer ");
assert_eq!(result, vec!["Alice", "30", "Engineer"]);
}
#[test]
fn test_split_first_occurrence_found() {
assert_eq!(
split_first_occurrence("key=value=extra", '='),
Some(("key", "value=extra"))
);
}
#[test]
fn test_split_first_occurrence_not_found() {
assert_eq!(split_first_occurrence("no-delimiter-here", '='), None);
}
}
Deep Comparison
OCaml vs Rust: String.split_on_char — Tokenize a String
Side-by-Side Code
OCaml
let csv_line = "Alice,30,Engineer,Amsterdam"
let fields = String.split_on_char ',' csv_line
let () = List.iteri (fun i f -> Printf.printf "Field %d: %s\n" i f) fields
let words = String.split_on_char ' ' " hello world "
let nonempty = List.filter (fun s -> s <> "") words
Rust (idiomatic)
let csv_line = "Alice,30,Engineer,Amsterdam";
let fields: Vec<&str> = csv_line.split(',').collect();
for (i, f) in fields.iter().enumerate() {
println!("Field {}: {}", i, f);
}
let words: Vec<&str> = " hello world "
.split_whitespace()
.collect();
Rust (functional pipeline)
pub fn split_nonempty(s: &str, delim: char) -> Vec<&str> {
s.split(delim).filter(|t| !t.is_empty()).collect()
}
pub fn parse_csv(line: &str) -> Vec<&str> {
line.split(',').map(str::trim).collect()
}
Type Signatures
| Concept | OCaml | Rust |
|---|---|---|
| split | String.split_on_char : char -> string -> string list | str::split(pattern) -> Split<'_, char> (lazy iterator) |
| result type | string list (owned, allocated) | Vec<&str> (borrows from source) |
| filter empty | List.filter (fun s -> s <> "") tokens | .filter(\|s\| !s.is_empty()) (iterator adapter) |
| index iteration | List.iteri (fun i f -> ...) fields | fields.iter().enumerate() |
| whitespace split | split ' ' + filter | str::split_whitespace() (built-in) |
Key Insights
String.split_on_char immediately allocates and returns a string list; Rust's str::split returns a lazy Split<'_, char> iterator — no allocation until .collect() is called, and you can chain further adapters without intermediate collections.String.split_on_char delimiter string (delimiter first — pipe-friendly); Rust: string.split(delimiter) (method on the string, delimiter as argument).string values; Rust's split returns &str slices that borrow from the original string — zero-copy, but the slices cannot outlive the source without cloning.List.filter (fun s -> s <> "") in OCaml or .filter(|s| !s.is_empty()) in Rust.str::split_once(delim) which returns Option<(&str, &str)> for the first occurrence — a common pattern with no direct OCaml stdlib equivalent.When to Use Each Style
**Use .split().collect() when:** you need a Vec<&str> to index into or pass around.
**Use .split().filter()... as a lazy chain when:** you only need to iterate — avoid materializing a Vec if you process the tokens in a single pass.
**Use split_whitespace() when:** splitting on any whitespace and ignoring runs of spaces — it's shorter and clearer than split(' ').filter(|s| !s.is_empty()).
Exercises
parse_csv_record(line: &str) -> Vec<&str> that splits on commas and trims leading and trailing whitespace from each field using .map(str::trim) in the iterator chain. Handle quoted fields containing commas as a stretch goal.word_count(text: &str) -> std::collections::HashMap<&str, usize> that splits on whitespace, filters empty tokens, and counts the occurrences of each word using HashMap::entry(...).and_modify(...).or_insert(1).split_first(s: &str, delim: char) -> Option<(&str, &str)> that splits on the first occurrence of delim and returns Some((before, after)), or None if the delimiter is not present. Use str::splitn(2, delim) and pattern-match on the resulting iterator to extract both parts.