480: chars() and Char-Level Operations
Difficulty: 1 Level: Intermediate Iterate Unicode scalar values correctly โ and why you can't index a Rust string with `[i]`.The Problem This Solves
In Python, `"cafรฉ"[3]` gives you `'รฉ'`. In JavaScript, `"cafรฉ"[3]` gives you `'รฉ'` too (at least for Basic Multilingual Plane characters). It just works. In Rust, `"cafรฉ"[3]` is a compile error. You can't index a string with `[]`. Why? Because Rust strings are UTF-8 bytes. `"cafรฉ"` is 5 bytes, not 4. The `รฉ` takes 2 bytes. If Rust let you write `s[3]`, what would you get โ a byte? Half a character? A panic? The designers decided: if the operation is ambiguous or potentially unsound, it shouldn't compile. The correct approach: `.chars()` gives you an iterator of Unicode scalar values (Rust's `char` โ a 4-byte value covering all of Unicode). Use `.chars().nth(3)` instead of `s[3]`. Use `.chars().count()` instead of `s.len()` when you want character count. Use `.chars().filter()`, `.chars().map()`, `.collect()` for character-level transformations.The Intuition
`chars()` is "iterate this string as a sequence of Unicode code points." Each element is a `char` โ a 32-bit value representing one Unicode scalar value (emoji ๐ included). Python's `for c in "cafรฉ"` does the same thing. The difference: Rust makes it explicit that you're iterating chars, not bytes. You can't accidentally iterate bytes when you meant characters. The workflow for character transformations: iterate with `.chars()`, transform with iterator adapters (`map`, `filter`, `rev`, `enumerate`), then collect back to a `String` with `.collect()`. This is idiomatic, composable, and clear. Key limitation: `.chars()` doesn't give you grapheme clusters. `"e\u{0301}"` (e + combining accent) is two chars but one user-perceived character. For grapheme clusters, use the `unicode-segmentation` crate.How It Works in Rust
let s = "Hello, World! ๐";
// len() = byte count, chars().count() = Unicode scalar count
println!("{} bytes, {} chars", s.len(), s.chars().count());
// 18 bytes, 15 chars (๐ is 4 bytes, 1 char)
// chars() with enumerate โ gives index + char
for (i, c) in s.chars().enumerate().take(5) {
println!("[{}] '{}' U+{:04X}", i, c, c as u32);
}
// Map โ transform each char, collect back to String
let upper: String = s.chars()
.map(|c| c.to_uppercase().next().unwrap())
.collect();
// Filter โ keep only alphabetic chars
let alpha: String = s.chars()
.filter(|c| c.is_alphabetic())
.collect(); // "HelloWorld"
// Reverse โ works correctly for multi-byte chars!
let rev: String = s.chars().rev().collect();
// String reversal by byte index would break UTF-8
// nth โ O(n) but correct for Unicode
s.chars().nth(2) // Some('l')
// Can't do: s[2] โ compile error
// Can't do: s[2..3] โ panics if not on char boundary
What This Unlocks
- Safe string reversal โ `.chars().rev().collect()` handles multi-byte characters correctly.
- Unicode-aware transformations โ filter emoji, count letters, apply ROT13, all via iterator chains.
- Character-level validation โ `s.chars().all(|c| c.is_ascii())` without importing anything.
Key Differences
| Concept | OCaml | Rust | ||
|---|---|---|---|---|
| Iterate characters | `String.iter s` (bytes in practice) | `s.chars()` (Unicode scalar values) | ||
| Index by char | `s.[i]` (byte, not char) | `s.chars().nth(i)` โ `Option<char>` | ||
| Character count | Manual UTF-8 decode | `s.chars().count()` | ||
| Map over chars | `String.map f s` | `s.chars().map(f).collect::<String>()` | ||
| Reverse | Custom loop | `s.chars().rev().collect()` | ||
| Filter chars | `String.concat "" (List.filter ...)` | `s.chars().filter(\ | c\ | ...).collect()` |
| Direct indexing | `s.[i]` (unsafe for UTF-8) | Compile error โ intentionally disallowed |
// 480. chars() and char-level operations
fn main() {
let s = "Hello, World! ๐";
println!("bytes={} chars={}", s.len(), s.chars().count());
// Enumerate first 5 chars
for (i,c) in s.chars().enumerate().take(5) {
println!(" [{}] '{}' U+{:04X}", i, c, c as u32);
}
// Map โ uppercase
let upper: String = s.chars().map(|c| c.to_uppercase().next().unwrap()).collect();
println!("{}", upper);
// Filter โ only alphabetic
let alpha: String = s.chars().filter(|c| c.is_alphabetic()).collect();
println!("alpha: {}", alpha);
// Reverse (correct for multi-byte chars!)
let rev: String = s.chars().rev().collect();
println!("rev: {}", rev);
// ROT13
let rot: String = s.chars().map(|c| {
if c.is_ascii_alphabetic() {
let base = if c.is_uppercase() { b'A' } else { b'a' };
((c as u8 - base + 13) % 26 + base) as char
} else { c }
}).collect();
println!("rot13: {}", rot);
println!("nth(2): {:?}", s.chars().nth(2));
}
#[cfg(test)]
mod tests {
#[test] fn test_count() { assert_eq!("cafรฉ".chars().count(),4); assert_eq!("cafรฉ".len(),5); }
#[test] fn test_filter() { let s:String="Hello123".chars().filter(|c|c.is_ascii_digit()).collect(); assert_eq!(s,"123"); }
#[test] fn test_rev() { let s:String="abcde".chars().rev().collect(); assert_eq!(s,"edcba"); }
#[test] fn test_nth() { assert_eq!("hello".chars().nth(1),Some('e')); }
}
(* 480. chars() โ OCaml *)
let () =
let s = "Hello, World! ๐" in
Printf.printf "byte_len=%d\n" (String.length s);
String.iter (fun c -> Printf.printf "%c " c) (String.sub s 0 7); print_newline ();
let upper = String.map Char.uppercase_ascii s in
Printf.printf "%s\n" upper;
let alpha = String.concat "" (
String.to_seq s |> Seq.filter (fun c -> Char.code c < 128 && (c>='a'&&c<='z'||c>='A'&&c<='Z'))
|> Seq.map (String.make 1) |> List.of_seq) in
Printf.printf "alpha: %s\n" alpha