πŸ¦€ Functional Rust

483: UTF-8 Encoding Patterns

Difficulty: 1 Level: Intermediate Rust strings are always UTF-8. Understanding that invariant unlocks safe, efficient text handling.

The Problem This Solves

Many languages let you store arbitrary bytes in strings and discover encoding problems at runtime β€” a `UnicodeDecodeError` in Python, a garbled web page from a Latin-1 mismatch in older Java code. These bugs are subtle: code works fine in English, then breaks on an Γ© or a Japanese character. Rust enforces UTF-8 at the type level. `str` and `String` guarantee valid UTF-8. You can't accidentally create an invalid string β€” the constructors check at the boundary. If you receive bytes from the outside world and they're not valid UTF-8, you find out immediately with a `Result`, not silently later. When you genuinely need other encodings (reading Latin-1 files, speaking to Windows APIs, parsing network protocols), you convert explicitly at the boundary and work with `String` inside your program.

The Intuition

Think of UTF-8 as the contract your application enforces internally. Anything that comes in from outside β€” file bytes, network packets, OS strings β€” must pass through a border check. Valid UTF-8 enters as `String`. Invalid bytes stay as `Vec<u8>` or `[u8]` until you decide how to convert them. There's no way for an encoding bug to quietly corrupt data downstream.

How It Works in Rust

1. Validate bytes as UTF-8:
let bytes: &[u8] = b"caf\xc3\xa9"; // "cafΓ©" in UTF-8
let s = std::str::from_utf8(bytes)?; // returns &str or Utf8Error
2. Lossy conversion β€” replace invalid sequences with `U+FFFD`:
let s = String::from_utf8_lossy(bytes); // Cow<str>
3. Byte access β€” when you need raw bytes back:
let bytes: &[u8] = s.as_bytes();
4. Character iteration β€” iterate Unicode scalar values, not bytes:
for ch in "cafΓ©".chars() { /* ch is a char (Unicode scalar value) */ }
5. Other encodings β€” use the `encoding_rs` crate for Latin-1, UTF-16, Shift-JIS, etc.:
let (decoded, _enc, had_errors) = encoding_rs::WINDOWS_1252.decode(latin1_bytes);

What This Unlocks

Key Differences

ConceptOCamlRust
String encodingBytes (no guarantee)Guaranteed UTF-8
ValidationManual`str::from_utf8()` at boundary
Char type`char` (byte in `Bytes`)`char` (Unicode scalar, 4 bytes)
Other encodings`Uutf` crate`encoding_rs` crate
// 483. UTF-8 encoding patterns
fn main() {
    // char β†’ UTF-8 bytes
    let mut buf = [0u8; 4];
    let c = 'Γ©';
    let s = c.encode_utf8(&mut buf);
    println!("'{}' encodes to {:?} ({} bytes)", c, &buf[..s.len()], s.len());

    // Emoji
    let emoji = '🌍';
    let n = emoji.encode_utf8(&mut buf).len();
    println!("'{}' encodes to {} bytes", emoji, n);

    // All chars from string as bytes
    for c in "cafΓ©".chars() {
        let n = c.encode_utf8(&mut buf).len();
        println!("  '{}' β†’ {:02x?}", c, &buf[..n]);
    }

    // Validate &[u8] as UTF-8
    let valid_bytes: &[u8] = &[104, 101, 108, 108, 111];
    let invalid_bytes: &[u8] = &[104, 0xFF, 111];
    println!("valid: {:?}", std::str::from_utf8(valid_bytes));
    println!("invalid: {}", std::str::from_utf8(invalid_bytes).is_err());

    // BOM handling
    let with_bom = "\u{FEFF}Hello";
    let stripped = with_bom.strip_prefix('\u{FEFF}').unwrap_or(with_bom);
    println!("stripped BOM: '{}'", stripped);

    // Char length in bytes
    for c in ['A', 'é', '中', '🌍'] {
        println!("  '{}' β†’ {} bytes", c, c.len_utf8());
    }
}

#[cfg(test)]
mod tests {
    #[test] fn test_encode()   { let mut b=[0u8;4]; assert_eq!('A'.encode_utf8(&mut b),  "A"); assert_eq!('Γ©'.len_utf8(), 2); }
    #[test] fn test_validate() { assert!(std::str::from_utf8(&[104,105]).is_ok()); assert!(std::str::from_utf8(&[0xFF]).is_err()); }
    #[test] fn test_bom()      { let s="\u{FEFF}hi"; assert_eq!(s.strip_prefix('\u{FEFF}'),Some("hi")); }
}
(* 483. UTF-8 encoding – OCaml *)
let encode_utf8 codepoint =
  if codepoint < 0x80 then
    Bytes.create 1 |> (fun b -> Bytes.set b 0 (Char.chr codepoint); b)
  else if codepoint < 0x800 then
    let b = Bytes.create 2 in
    Bytes.set b 0 (Char.chr (0xC0 lor (codepoint lsr 6)));
    Bytes.set b 1 (Char.chr (0x80 lor (codepoint land 0x3F)));
    b
  else
    let b = Bytes.create 3 in
    Bytes.set b 0 (Char.chr (0xE0 lor (codepoint lsr 12)));
    Bytes.set b 1 (Char.chr (0x80 lor ((codepoint lsr 6) land 0x3F)));
    Bytes.set b 2 (Char.chr (0x80 lor (codepoint land 0x3F)));
    b

let () =
  let e = encode_utf8 0xE9 in  (* Γ© *)
  Printf.printf "Γ© bytes: %02x %02x\n" (Char.code (Bytes.get e 0)) (Char.code (Bytes.get e 1));
  let s = "caf\xc3\xa9" in
  Printf.printf "valid UTF-8 test: byte_len=%d\n" (String.length s)