🦀 Functional Rust

482: Unicode Normalization and Graphemes

Difficulty: 2 Level: Intermediate Why `"café" != "café"` can be true — and how to handle Unicode correctly in Rust.

The Problem This Solves

Unicode is harder than it looks. The character `é` can be encoded two ways: as a single codepoint `U+00E9` (NFC — "composed"), or as the letter `e` followed by a combining accent `U+0301` (NFD — "decomposed"). Both look identical on screen. Both are valid UTF-8. But they are different byte sequences — and Rust's `==` operator compares bytes, so `"café" == "café"` can return `false` if one was composed and the other decomposed. This matters in practice: text copied from macOS tends to be NFD. Text from Windows tends to be NFC. If you compare usernames, file paths, or search terms without normalizing, you get false negatives. The `unicode-normalization` crate handles this, but the std library doesn't — a deliberate decision to keep the core small. Even within std, there are surprises: the flag emoji `🇳🇱` is two Unicode scalar values (regional indicator N + regional indicator L) but one user-visible character. `.chars().count()` returns 2 for the flag, not 1. For true grapheme cluster counting, you need the `unicode-segmentation` crate.

The Intuition

Think of Unicode as having three levels: 1. Bytes — raw UTF-8 storage (what `s.len()` measures) 2. Scalar values — what `s.chars()` iterates (Rust's `char` type, U+0000 to U+10FFFF) 3. Grapheme clusters — what users see as characters (requires external crate) NFC vs NFD is about level 2: the same user-visible character, different scalar value sequences. Rust's `==` operates at level 1 (bytes), so it distinguishes NFC from NFD. For case-insensitive ASCII comparison, std has `.eq_ignore_ascii_case()`. For full Unicode case folding, you need the `unicode-casefold` crate. The practical rule: for most applications, receive text, assume it's NFC or normalize it on input, and compare bytes. Only reach for normalization crates when you're building search engines, identifier systems, or any user-facing comparison.

How It Works in Rust

// Two ways to write "café" — look the same, different bytes
let nfc = "caf\u{00E9}";          // U+00E9: single codepoint é
let nfd = "caf\u{0065}\u{0301}"; // U+0065 + U+0301: e + combining accent

println!("NFC: {} bytes, {} chars", nfc.len(), nfc.chars().count()); // 5, 4
println!("NFD: {} bytes, {} chars", nfd.len(), nfd.chars().count()); // 6, 5

// Byte comparison — these are NOT equal!
println!("{}", nfc == nfd);  // false

// Case-insensitive comparison (ASCII only, in std)
"Hello".eq_ignore_ascii_case("HELLO")  // true
"café".eq_ignore_ascii_case("CAFÉ")    // false — only works for ASCII letters

// Check if all ASCII (no multi-byte chars)
"hello".is_ascii()  // true
"café".is_ascii()   // false

// Emoji: chars().count() counts scalar values, not graphemes
let flag = "\u{1F1F3}\u{1F1F1}";  // 🇳🇱 — two scalar values, one grapheme
println!("{} bytes, {} chars", flag.len(), flag.chars().count()); // 8, 2

// Unicode category predicates on char
for c in "Hello 42 !".chars() {
 if c.is_alphabetic() { /* letter */ }
 if c.is_numeric()    { /* digit */ }
 if c.is_whitespace() { /* space */ }
 if c.is_uppercase()  { /* A-Z + Unicode uppercase */ }
}

// For normalization: use unicode-normalization crate
// use unicode_normalization::UnicodeNormalization;
// let normalized = nfd.nfc().collect::<String>();

What This Unlocks

Key Differences

ConceptOCamlRust
String equalityByte comparisonByte comparison (NFC ≠ NFD)
Character countManual UTF-8 decode loop`.chars().count()` (scalar values)
Grapheme countUutf crate`unicode-segmentation` crate
Case-insensitive eqManual`.eq_ignore_ascii_case()` (ASCII only)
NormalizationUunf crate`unicode-normalization` crate
Is ASCII?Manual check`s.is_ascii()`
Char predicates`Char.code c < 128` etc.`c.is_alphabetic()`, `.is_numeric()`, etc.
// 482. Unicode normalization and graphemes
fn main() {
    let s = "café";
    println!("bytes={} chars={}", s.len(), s.chars().count());

    // Two ways to write café:
    let nfc = "caf\u{00E9}";          // NFC: single codepoint é
    let nfd = "caf\u{0065}\u{0301}"; // NFD: e + combining accent
    println!("NFC bytes={} chars={}", nfc.len(), nfc.chars().count());
    println!("NFD bytes={} chars={}", nfd.len(), nfd.chars().count());
    println!("NFC==NFD as &str: {}", nfc == nfd); // false! different bytes
    println!("NFC chars: {:?}", nfc.chars().collect::<Vec<_>>());
    println!("NFD chars: {:?}", nfd.chars().collect::<Vec<_>>());

    // Case-insensitive comparison (ASCII only in std)
    let a = "Hello"; let b = "HELLO";
    println!("eq_ignore_ascii: {}", a.eq_ignore_ascii_case(b));

    // Check if string is ASCII
    println!("is_ascii: {}", "hello".is_ascii());
    println!("is_ascii: {}", "café".is_ascii());

    // Emoji: one grapheme cluster = 2 scalar values (flag emoji)
    let flag = "\u{1F1F3}\u{1F1F1}"; // NL flag
    println!("flag bytes={} chars={}", flag.len(), flag.chars().count());
    // For grapheme cluster count: use unicode-segmentation crate
    // (not available in std; would give count=1 for the flag)

    // Unicode category checks
    for c in "Hello 42 !".chars() {
        print!("{}:{} ", c, if c.is_alphabetic(){"alpha"} else if c.is_numeric(){"num"} else {"other"});
    } println!();
}

#[cfg(test)]
mod tests {
    #[test] fn test_nfc_nfd()   { assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}"); }
    #[test] fn test_ascii_eq()  { assert!("hello".eq_ignore_ascii_case("HELLO")); }
    #[test] fn test_is_ascii()  { assert!("hello".is_ascii()); assert!(!"café".is_ascii()); }
    #[test] fn test_emoji()     { let e="\u{1F600}"; assert_eq!(e.len(),4); assert_eq!(e.chars().count(),1); }
}
(* 482. Unicode – OCaml with Uutf *)
(* Requires: ocamlfind + uutf / uunf for full Unicode support *)
(* Using standard string for demonstration *)
let () =
  let s = "caf\xc3\xa9" in  (* café in UTF-8 bytes *)
  Printf.printf "byte_len=%d\n" (String.length s);

  (* Count Unicode code points manually for UTF-8 *)
  let count_codepoints s =
    let n = String.length s in
    let count = ref 0 and i = ref 0 in
    while !i < n do
      let b = Char.code s.[!i] in
      let len = if b land 0x80 = 0 then 1
                else if b land 0xE0 = 0xC0 then 2
                else if b land 0xF0 = 0xE0 then 3
                else 4 in
      incr count; i := !i + len
    done;
    !count
  in
  Printf.printf "codepoints=%d\n" (count_codepoints s);

  (* Check if valid UTF-8 by trying to iterate *)
  let is_valid s =
    let n = String.length s in
    let i = ref 0 and ok = ref true in
    while !i < n && !ok do
      let b = Char.code s.[!i] in
      let len = if b land 0x80=0 then 1 else if b land 0xE0=0xC0 then 2
                else if b land 0xF0=0xE0 then 3 else if b land 0xF8=0xF0 then 4
                else (ok:=false; 0) in
      i := !i + len
    done; !ok
  in
  Printf.printf "valid=%b\n" (is_valid s)