482 Fundamental

String Unicode

Functional Programming

Tutorial

The Problem

Unicode defines multiple ways to represent the same visual character: é can be a single precomposed codepoint (U+00E9) or a base letter e (U+0065) followed by a combining accent (U+0301). These two sequences look identical but compare unequal as byte strings. Web forms, databases, and search engines must normalise Unicode before comparison. Emoji occupy 4 bytes in UTF-8 (U+1F600 = \u{1F600}) — naive len() returns 4, not 1. Correct Unicode handling requires understanding: NFC/NFD normalisation, grapheme clusters, and the difference between bytes, codepoints, and user-perceived characters.

🎯 Learning Outcomes

• Understand that NFC and NFD representations of the same character compare unequal

• Use eq_ignore_ascii_case for case-insensitive ASCII comparison without allocation

• Check ASCII-only strings with str::is_ascii()

• Understand emoji encoding: 4 UTF-8 bytes, 1 char, 1 grapheme cluster

• Recognise when the unicode-normalization crate is needed for correct comparison

Code Example

#![allow(clippy::all)]
// 482. Unicode normalization and graphemes

#[cfg(test)]
mod tests {
    #[test]
    fn test_nfc_nfd() {
        assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
    }
    #[test]
    fn test_ascii_eq() {
        assert!("hello".eq_ignore_ascii_case("HELLO"));
    }
    #[test]
    fn test_is_ascii() {
        assert!("hello".is_ascii());
        assert!(!"café".is_ascii());
    }
    #[test]
    fn test_emoji() {
        let e = "\u{1F600}";
        assert_eq!(e.len(), 4);
        assert_eq!(e.chars().count(), 1);
    }
}

(* 482. Unicode – OCaml with Uutf *)
(* Requires: ocamlfind + uutf / uunf for full Unicode support *)
(* Using standard string for demonstration *)
let () =
  let s = "caf\xc3\xa9" in  (* café in UTF-8 bytes *)
  Printf.printf "byte_len=%d\n" (String.length s);

  (* Count Unicode code points manually for UTF-8 *)
  let count_codepoints s =
    let n = String.length s in
    let count = ref 0 and i = ref 0 in
    while !i < n do
      let b = Char.code s.[!i] in
      let len = if b land 0x80 = 0 then 1
                else if b land 0xE0 = 0xC0 then 2
                else if b land 0xF0 = 0xE0 then 3
                else 4 in
      incr count; i := !i + len
    done;
    !count
  in
  Printf.printf "codepoints=%d\n" (count_codepoints s);

  (* Check if valid UTF-8 by trying to iterate *)
  let is_valid s =
    let n = String.length s in
    let i = ref 0 and ok = ref true in
    while !i < n && !ok do
      let b = Char.code s.[!i] in
      let len = if b land 0x80=0 then 1 else if b land 0xE0=0xC0 then 2
                else if b land 0xF0=0xE0 then 3 else if b land 0xF8=0xF0 then 4
                else (ok:=false; 0) in
      i := !i + len
    done; !ok
  in
  Printf.printf "valid=%b\n" (is_valid s)

Key Differences

Standard Unicode properties: Rust's char::is_alphabetic() uses the Unicode Alphabetic property; OCaml's Char.is_alpha is ASCII-only (via is_alpha from Char).

NFC/NFD in stdlib: Rust delegates normalisation to unicode-normalization crate; OCaml delegates to uunf — neither includes it in the standard library.

**eq_ignore_ascii_case**: Rust has this in the standard library; OCaml needs String.lowercase_ascii + compare.

Emoji byte count: Both languages store emoji as 4-byte UTF-8 sequences; both .chars().count() / Uutf yield 1 codepoint; both require unicode-segmentation / Uuseg for grapheme cluster counting.

OCaml Approach

OCaml's standard library has no Unicode normalisation. String.equal is byte equality. Case-insensitive comparison requires String.lowercase_ascii (ASCII-only) or Uucp.Case.fold (full Unicode):

String.equal
  (String.lowercase_ascii "Hello")
  (String.lowercase_ascii "HELLO")  (* true *)

(* Unicode normalisation via uunf *)
let nfc s =
  let buf = Buffer.create (String.length s) in
  let norm = Uunf.create `NFC in
  (* feed codepoints from Uutf, flush from Uunf into buf *)
  Buffer.contents buf

Full Source

#![allow(clippy::all)]
// 482. Unicode normalization and graphemes

#[cfg(test)]
mod tests {
    #[test]
    fn test_nfc_nfd() {
        assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
    }
    #[test]
    fn test_ascii_eq() {
        assert!("hello".eq_ignore_ascii_case("HELLO"));
    }
    #[test]
    fn test_is_ascii() {
        assert!("hello".is_ascii());
        assert!(!"café".is_ascii());
    }
    #[test]
    fn test_emoji() {
        let e = "\u{1F600}";
        assert_eq!(e.len(), 4);
        assert_eq!(e.chars().count(), 1);
    }
}

(* 482. Unicode – OCaml with Uutf *)
(* Requires: ocamlfind + uutf / uunf for full Unicode support *)
(* Using standard string for demonstration *)
let () =
  let s = "caf\xc3\xa9" in  (* café in UTF-8 bytes *)
  Printf.printf "byte_len=%d\n" (String.length s);

  (* Count Unicode code points manually for UTF-8 *)
  let count_codepoints s =
    let n = String.length s in
    let count = ref 0 and i = ref 0 in
    while !i < n do
      let b = Char.code s.[!i] in
      let len = if b land 0x80 = 0 then 1
                else if b land 0xE0 = 0xC0 then 2
                else if b land 0xF0 = 0xE0 then 3
                else 4 in
      incr count; i := !i + len
    done;
    !count
  in
  Printf.printf "codepoints=%d\n" (count_codepoints s);

  (* Check if valid UTF-8 by trying to iterate *)
  let is_valid s =
    let n = String.length s in
    let i = ref 0 and ok = ref true in
    while !i < n && !ok do
      let b = Char.code s.[!i] in
      let len = if b land 0x80=0 then 1 else if b land 0xE0=0xC0 then 2
                else if b land 0xF0=0xE0 then 3 else if b land 0xF8=0xF0 then 4
                else (ok:=false; 0) in
      i := !i + len
    done; !ok
  in
  Printf.printf "valid=%b\n" (is_valid s)

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    #[test]
    fn test_nfc_nfd() {
        assert_ne!("caf\u{00E9}", "caf\u{0065}\u{0301}");
    }
    #[test]
    fn test_ascii_eq() {
        assert!("hello".eq_ignore_ascii_case("HELLO"));
    }
    #[test]
    fn test_is_ascii() {
        assert!("hello".is_ascii());
        assert!(!"café".is_ascii());
    }
    #[test]
    fn test_emoji() {
        let e = "\u{1F600}";
        assert_eq!(e.len(), 4);
        assert_eq!(e.chars().count(), 1);
    }
}

Exercises

NFC normalise and compare: Write unicode_eq(a: &str, b: &str) -> bool that normalises both strings to NFC (using unicode-normalization) before comparing.

Emoji counter: Write count_emoji(s: &str) -> usize that counts characters with Unicode category So (Other Symbol) using char::is_ascii() inversion and the unicode-properties crate.

Case folding: Use the caseless crate to implement case_fold_eq(a: &str, b: &str) -> bool that handles the Turkish dotless-i and other Unicode case-folding edge cases.

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust