๐Ÿฆ€ Functional Rust

480: chars() and Char-Level Operations

Difficulty: 1 Level: Intermediate Iterate Unicode scalar values correctly โ€” and why you can't index a Rust string with `[i]`.

The Problem This Solves

In Python, `"cafรฉ"[3]` gives you `'รฉ'`. In JavaScript, `"cafรฉ"[3]` gives you `'รฉ'` too (at least for Basic Multilingual Plane characters). It just works. In Rust, `"cafรฉ"[3]` is a compile error. You can't index a string with `[]`. Why? Because Rust strings are UTF-8 bytes. `"cafรฉ"` is 5 bytes, not 4. The `รฉ` takes 2 bytes. If Rust let you write `s[3]`, what would you get โ€” a byte? Half a character? A panic? The designers decided: if the operation is ambiguous or potentially unsound, it shouldn't compile. The correct approach: `.chars()` gives you an iterator of Unicode scalar values (Rust's `char` โ€” a 4-byte value covering all of Unicode). Use `.chars().nth(3)` instead of `s[3]`. Use `.chars().count()` instead of `s.len()` when you want character count. Use `.chars().filter()`, `.chars().map()`, `.collect()` for character-level transformations.

The Intuition

`chars()` is "iterate this string as a sequence of Unicode code points." Each element is a `char` โ€” a 32-bit value representing one Unicode scalar value (emoji ๐ŸŒ included). Python's `for c in "cafรฉ"` does the same thing. The difference: Rust makes it explicit that you're iterating chars, not bytes. You can't accidentally iterate bytes when you meant characters. The workflow for character transformations: iterate with `.chars()`, transform with iterator adapters (`map`, `filter`, `rev`, `enumerate`), then collect back to a `String` with `.collect()`. This is idiomatic, composable, and clear. Key limitation: `.chars()` doesn't give you grapheme clusters. `"e\u{0301}"` (e + combining accent) is two chars but one user-perceived character. For grapheme clusters, use the `unicode-segmentation` crate.

How It Works in Rust

let s = "Hello, World! ๐ŸŒ";

// len() = byte count, chars().count() = Unicode scalar count
println!("{} bytes, {} chars", s.len(), s.chars().count());
// 18 bytes, 15 chars  (๐ŸŒ is 4 bytes, 1 char)

// chars() with enumerate โ€” gives index + char
for (i, c) in s.chars().enumerate().take(5) {
 println!("[{}] '{}' U+{:04X}", i, c, c as u32);
}

// Map โ€” transform each char, collect back to String
let upper: String = s.chars()
 .map(|c| c.to_uppercase().next().unwrap())
 .collect();

// Filter โ€” keep only alphabetic chars
let alpha: String = s.chars()
 .filter(|c| c.is_alphabetic())
 .collect();  // "HelloWorld"

// Reverse โ€” works correctly for multi-byte chars!
let rev: String = s.chars().rev().collect();
// String reversal by byte index would break UTF-8

// nth โ€” O(n) but correct for Unicode
s.chars().nth(2)  // Some('l')

// Can't do: s[2]  โ† compile error
// Can't do: s[2..3]  โ† panics if not on char boundary

What This Unlocks

Key Differences

ConceptOCamlRust
Iterate characters`String.iter s` (bytes in practice)`s.chars()` (Unicode scalar values)
Index by char`s.[i]` (byte, not char)`s.chars().nth(i)` โ€” `Option<char>`
Character countManual UTF-8 decode`s.chars().count()`
Map over chars`String.map f s``s.chars().map(f).collect::<String>()`
ReverseCustom loop`s.chars().rev().collect()`
Filter chars`String.concat "" (List.filter ...)``s.chars().filter(\c\...).collect()`
Direct indexing`s.[i]` (unsafe for UTF-8)Compile error โ€” intentionally disallowed
// 480. chars() and char-level operations
fn main() {
    let s = "Hello, World! ๐ŸŒ";
    println!("bytes={} chars={}", s.len(), s.chars().count());

    // Enumerate first 5 chars
    for (i,c) in s.chars().enumerate().take(5) {
        println!("  [{}] '{}' U+{:04X}", i, c, c as u32);
    }

    // Map โ€” uppercase
    let upper: String = s.chars().map(|c| c.to_uppercase().next().unwrap()).collect();
    println!("{}", upper);

    // Filter โ€” only alphabetic
    let alpha: String = s.chars().filter(|c| c.is_alphabetic()).collect();
    println!("alpha: {}", alpha);

    // Reverse (correct for multi-byte chars!)
    let rev: String = s.chars().rev().collect();
    println!("rev: {}", rev);

    // ROT13
    let rot: String = s.chars().map(|c| {
        if c.is_ascii_alphabetic() {
            let base = if c.is_uppercase() { b'A' } else { b'a' };
            ((c as u8 - base + 13) % 26 + base) as char
        } else { c }
    }).collect();
    println!("rot13: {}", rot);

    println!("nth(2): {:?}", s.chars().nth(2));
}

#[cfg(test)]
mod tests {
    #[test] fn test_count()   { assert_eq!("cafรฉ".chars().count(),4); assert_eq!("cafรฉ".len(),5); }
    #[test] fn test_filter()  { let s:String="Hello123".chars().filter(|c|c.is_ascii_digit()).collect(); assert_eq!(s,"123"); }
    #[test] fn test_rev()     { let s:String="abcde".chars().rev().collect(); assert_eq!(s,"edcba"); }
    #[test] fn test_nth()     { assert_eq!("hello".chars().nth(1),Some('e')); }
}
(* 480. chars() โ€“ OCaml *)
let () =
  let s = "Hello, World! ๐ŸŒ" in
  Printf.printf "byte_len=%d\n" (String.length s);
  String.iter (fun c -> Printf.printf "%c " c) (String.sub s 0 7); print_newline ();
  let upper = String.map Char.uppercase_ascii s in
  Printf.printf "%s\n" upper;
  let alpha = String.concat "" (
    String.to_seq s |> Seq.filter (fun c -> Char.code c < 128 && (c>='a'&&c<='z'||c>='A'&&c<='Z'))
    |> Seq.map (String.make 1) |> List.of_seq) in
  Printf.printf "alpha: %s\n" alpha