1185 Intermediate

String.split_on_char — Tokenize a String

Functional Programming

Tutorial

The Problem

Split a string on a delimiter character to produce a list of tokens, then filter out empty tokens that arise from consecutive delimiters or leading and trailing whitespace. This operation is fundamental to parsing CSV lines, tokenizing user input, and processing any character-delimited data format. The example covers the primary string splitting primitive in each language and shows how filtering empty strings integrates naturally into the functional pipeline.

🎯 Learning Outcomes

• How OCaml's String.split_on_char ',' s maps to Rust's s.split(',') — both split on a single character delimiter and preserve empty strings between consecutive delimiters

• How OCaml's List.filter (fun s -> s <> "") tokens maps to Rust's .filter(|s| !s.is_empty()) — the same concept expressed as a list operation vs. an iterator adapter

• Why Rust's split returns a lazy Split iterator rather than an allocated Vec, and how to materialize it with .collect::<Vec<_>>() when a concrete collection is needed

• The difference between str::split(char) for single-character delimiters and str::split_whitespace() for collapsing any run of whitespace — both useful, with different semantics

• How List.iteri maps to .iter().enumerate() in Rust for index-aware iteration, a pattern that appears repeatedly when processing tokenized data

Code Example

let csv_line = "Alice,30,Engineer,Amsterdam";
let fields: Vec<&str> = csv_line.split(',').collect();
for (i, f) in fields.iter().enumerate() {
    println!("Field {}: {}", i, f);
}

let words: Vec<&str> = "  hello   world  "
    .split_whitespace()
    .collect();

let csv_line = "Alice,30,Engineer,Amsterdam"
let fields = String.split_on_char ',' csv_line
let () = List.iteri (fun i f -> Printf.printf "Field %d: %s\n" i f) fields

let words = String.split_on_char ' ' "  hello   world  "
let nonempty = List.filter (fun s -> s <> "") words

Key Differences

Eager vs. lazy evaluation: OCaml's String.split_on_char immediately allocates and returns a string list; Rust's str::split returns a lazy iterator — allocation happens only when .collect() is called, and if you only need to iterate, you can avoid allocation entirely.

Argument order: OCaml: String.split_on_char delimiter string (delimiter first, string second — pipe-friendly); Rust: string.split(delimiter) (method on the string, delimiter as argument) — the string is the receiver in Rust's method call syntax.

Empty token handling: Both languages preserve empty strings between consecutive delimiters by default. OCaml removes them with List.filter (fun s -> s <> ""); Rust uses .filter(|s| !s.is_empty()) as an iterator adapter, or uses the separate split_whitespace() method which collapses runs automatically.

Ownership and borrowing: OCaml returns owned string values in the list; Rust's split returns &str slices that borrow from the original string, which is more memory-efficient but means the resulting slices cannot outlive the source string without cloning.

OCaml Approach

OCaml's String.split_on_char : char -> string -> string list (added in OCaml 4.04) takes the delimiter character first and the string second, returning a string list. It preserves empty strings between consecutive delimiters, so String.split_on_char ',' "a,,b" returns ["a"; ""; "b"]. Empty tokens from surrounding whitespace are removed with List.filter (fun s -> s <> "") tokens. Index-aware printing uses List.iteri (fun i f -> ...) fields, which passes the zero-based index alongside each element. All operations produce new values; no mutation occurs.

Full Source

#![allow(dead_code)]
//! String.split_on_char — Tokenize a String
//! See example.ml for OCaml reference
//!
//! OCaml's `String.split_on_char delim s` splits a string on a single character delimiter.
//! Rust's `str::split(delim)` is the direct equivalent — both preserve empty strings between
//! consecutive delimiters.

/// Idiomatic Rust: split a string on a delimiter character, preserving empty tokens.
/// Mirrors OCaml: `String.split_on_char delim s`
pub fn split_on_char(s: &str, delim: char) -> Vec<&str> {
    s.split(delim).collect()
}

/// Split and filter out empty tokens.
/// Mirrors OCaml: `List.filter (fun s -> s <> "") (String.split_on_char delim s)`
pub fn split_nonempty(s: &str, delim: char) -> Vec<&str> {
    s.split(delim).filter(|t| !t.is_empty()).collect()
}

/// Split on whitespace, dropping empty tokens (equivalent to OCaml's `String.split_on_char ' '`
/// followed by filtering, but handles any run of whitespace in a single step).
pub fn tokenize(s: &str) -> Vec<&str> {
    s.split_whitespace().collect()
}

/// Parse a CSV record: split on commas and trim each field.
pub fn parse_csv(line: &str) -> Vec<&str> {
    line.split(',').map(str::trim).collect()
}

/// Split only on the first occurrence of `delim`.
/// Returns `(before, after)` or `None` if delimiter not found.
/// Uses `str::split_once` — the idiomatic Rust approach.
pub fn split_first_occurrence(s: &str, delim: char) -> Option<(&str, &str)> {
    s.split_once(delim)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_split_empty_string() {
        // Splitting an empty string on any delimiter gives one empty token.
        assert_eq!(split_on_char("", ','), vec![""]);
    }

    #[test]
    fn test_split_single_field() {
        assert_eq!(split_on_char("hello", ','), vec!["hello"]);
    }

    #[test]
    fn test_split_csv_line() {
        let fields = split_on_char("Alice,30,Engineer,Amsterdam", ',');
        assert_eq!(fields, vec!["Alice", "30", "Engineer", "Amsterdam"]);
    }

    #[test]
    fn test_split_preserves_empty_tokens() {
        // Consecutive delimiters produce an empty string between them.
        let result = split_on_char("a,,b", ',');
        assert_eq!(result, vec!["a", "", "b"]);
    }

    #[test]
    fn test_split_nonempty_removes_empty_tokens() {
        let result = split_nonempty("  hello   world  ", ' ');
        assert_eq!(result, vec!["hello", "world"]);
    }

    #[test]
    fn test_tokenize_whitespace() {
        assert_eq!(tokenize("  hello   world  "), vec!["hello", "world"]);
    }

    #[test]
    fn test_parse_csv_trims_whitespace() {
        let result = parse_csv(" Alice , 30 , Engineer ");
        assert_eq!(result, vec!["Alice", "30", "Engineer"]);
    }

    #[test]
    fn test_split_first_occurrence_found() {
        assert_eq!(
            split_first_occurrence("key=value=extra", '='),
            Some(("key", "value=extra"))
        );
    }

    #[test]
    fn test_split_first_occurrence_not_found() {
        assert_eq!(split_first_occurrence("no-delimiter-here", '='), None);
    }
}

(* Idiomatic OCaml: String.split_on_char tokenizes a string *)
let csv_line = "Alice,30,Engineer,Amsterdam"
let fields = String.split_on_char ',' csv_line

(* Filter empty tokens arising from consecutive delimiters *)
let words = String.split_on_char ' ' "  hello   world  "
let nonempty = List.filter (fun s -> s <> "") words

(* Split first occurrence: use String.index + String.sub *)
let split_once delim s =
  match String.index_opt s delim with
  | None -> None
  | Some i ->
    let before = String.sub s 0 i in
    let after = String.sub s (i+1) (String.length s - i - 1) in
    Some (before, after)

let () =
  assert (fields = ["Alice"; "30"; "Engineer"; "Amsterdam"]);
  assert (List.length fields = 4);
  assert (nonempty = ["hello"; "world"]);
  assert (String.split_on_char ',' "" = [""]);
  assert (split_once '=' "key=value=extra" = Some ("key", "value=extra"));
  assert (split_once '=' "no-delimiter" = None);
  print_endline "ok"

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_split_empty_string() {
        // Splitting an empty string on any delimiter gives one empty token.
        assert_eq!(split_on_char("", ','), vec![""]);
    }

    #[test]
    fn test_split_single_field() {
        assert_eq!(split_on_char("hello", ','), vec!["hello"]);
    }

    #[test]
    fn test_split_csv_line() {
        let fields = split_on_char("Alice,30,Engineer,Amsterdam", ',');
        assert_eq!(fields, vec!["Alice", "30", "Engineer", "Amsterdam"]);
    }

    #[test]
    fn test_split_preserves_empty_tokens() {
        // Consecutive delimiters produce an empty string between them.
        let result = split_on_char("a,,b", ',');
        assert_eq!(result, vec!["a", "", "b"]);
    }

    #[test]
    fn test_split_nonempty_removes_empty_tokens() {
        let result = split_nonempty("  hello   world  ", ' ');
        assert_eq!(result, vec!["hello", "world"]);
    }

    #[test]
    fn test_tokenize_whitespace() {
        assert_eq!(tokenize("  hello   world  "), vec!["hello", "world"]);
    }

    #[test]
    fn test_parse_csv_trims_whitespace() {
        let result = parse_csv(" Alice , 30 , Engineer ");
        assert_eq!(result, vec!["Alice", "30", "Engineer"]);
    }

    #[test]
    fn test_split_first_occurrence_found() {
        assert_eq!(
            split_first_occurrence("key=value=extra", '='),
            Some(("key", "value=extra"))
        );
    }

    #[test]
    fn test_split_first_occurrence_not_found() {
        assert_eq!(split_first_occurrence("no-delimiter-here", '='), None);
    }
}

Deep Comparison

OCaml vs Rust: String.split_on_char — Tokenize a String

Side-by-Side Code

OCaml

let csv_line = "Alice,30,Engineer,Amsterdam"
let fields = String.split_on_char ',' csv_line
let () = List.iteri (fun i f -> Printf.printf "Field %d: %s\n" i f) fields

let words = String.split_on_char ' ' "  hello   world  "
let nonempty = List.filter (fun s -> s <> "") words

Rust (idiomatic)

let csv_line = "Alice,30,Engineer,Amsterdam";
let fields: Vec<&str> = csv_line.split(',').collect();
for (i, f) in fields.iter().enumerate() {
    println!("Field {}: {}", i, f);
}

let words: Vec<&str> = "  hello   world  "
    .split_whitespace()
    .collect();

Rust (functional pipeline)

pub fn split_nonempty(s: &str, delim: char) -> Vec<&str> {
    s.split(delim).filter(|t| !t.is_empty()).collect()
}

pub fn parse_csv(line: &str) -> Vec<&str> {
    line.split(',').map(str::trim).collect()
}

Type Signatures

Concept	OCaml	Rust
split	`String.split_on_char : char -> string -> string list`	`str::split(pattern) -> Split<'_, char>` (lazy iterator)
result type	`string list` (owned, allocated)	`Vec<&str>` (borrows from source)
filter empty	`List.filter (fun s -> s <> "") tokens`	`.filter(\\|s\\| !s.is_empty())` (iterator adapter)
index iteration	`List.iteri (fun i f -> ...) fields`	`fields.iter().enumerate()`
whitespace split	`split ' '` + filter	`str::split_whitespace()` (built-in)

Key Insights

Eager vs. lazy: OCaml's String.split_on_char immediately allocates and returns a string list; Rust's str::split returns a lazy Split<'_, char> iterator — no allocation until .collect() is called, and you can chain further adapters without intermediate collections.

Argument order: OCaml: String.split_on_char delimiter string (delimiter first — pipe-friendly); Rust: string.split(delimiter) (method on the string, delimiter as argument).

Ownership: OCaml returns owned string values; Rust's split returns &str slices that borrow from the original string — zero-copy, but the slices cannot outlive the source without cloning.

Consecutive delimiters: Both languages preserve empty strings between consecutive delimiters by default. Removing them requires List.filter (fun s -> s <> "") in OCaml or .filter(|s| !s.is_empty()) in Rust.

split_once: Rust 1.52+ provides str::split_once(delim) which returns Option<(&str, &str)> for the first occurrence — a common pattern with no direct OCaml stdlib equivalent.

When to Use Each Style

**Use .split().collect() when:** you need a Vec<&str> to index into or pass around. **Use .split().filter()... as a lazy chain when:** you only need to iterate — avoid materializing a Vec if you process the tokens in a single pass. **Use split_whitespace() when:** splitting on any whitespace and ignoring runs of spaces — it's shorter and clearer than split(' ').filter(|s| !s.is_empty()).

Exercises

Implement parse_csv_record(line: &str) -> Vec<&str> that splits on commas and trims leading and trailing whitespace from each field using .map(str::trim) in the iterator chain. Handle quoted fields containing commas as a stretch goal.

Implement word_count(text: &str) -> std::collections::HashMap<&str, usize> that splits on whitespace, filters empty tokens, and counts the occurrences of each word using HashMap::entry(...).and_modify(...).or_insert(1).

Implement split_first(s: &str, delim: char) -> Option<(&str, &str)> that splits on the first occurrence of delim and returns Some((before, after)), or None if the delimiter is not present. Use str::splitn(2, delim) and pattern-match on the resulting iterator to extract both parts.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust