483: UTF-8 Encoding Patterns
Difficulty: 1 Level: Intermediate Rust strings are always UTF-8. Understanding that invariant unlocks safe, efficient text handling.The Problem This Solves
Many languages let you store arbitrary bytes in strings and discover encoding problems at runtime β a `UnicodeDecodeError` in Python, a garbled web page from a Latin-1 mismatch in older Java code. These bugs are subtle: code works fine in English, then breaks on an Γ© or a Japanese character. Rust enforces UTF-8 at the type level. `str` and `String` guarantee valid UTF-8. You can't accidentally create an invalid string β the constructors check at the boundary. If you receive bytes from the outside world and they're not valid UTF-8, you find out immediately with a `Result`, not silently later. When you genuinely need other encodings (reading Latin-1 files, speaking to Windows APIs, parsing network protocols), you convert explicitly at the boundary and work with `String` inside your program.The Intuition
Think of UTF-8 as the contract your application enforces internally. Anything that comes in from outside β file bytes, network packets, OS strings β must pass through a border check. Valid UTF-8 enters as `String`. Invalid bytes stay as `Vec<u8>` or `[u8]` until you decide how to convert them. There's no way for an encoding bug to quietly corrupt data downstream.How It Works in Rust
1. Validate bytes as UTF-8:let bytes: &[u8] = b"caf\xc3\xa9"; // "cafΓ©" in UTF-8
let s = std::str::from_utf8(bytes)?; // returns &str or Utf8Error
2. Lossy conversion β replace invalid sequences with `U+FFFD`:
let s = String::from_utf8_lossy(bytes); // Cow<str>
3. Byte access β when you need raw bytes back:
let bytes: &[u8] = s.as_bytes();
4. Character iteration β iterate Unicode scalar values, not bytes:
for ch in "cafΓ©".chars() { /* ch is a char (Unicode scalar value) */ }
5. Other encodings β use the `encoding_rs` crate for Latin-1, UTF-16, Shift-JIS, etc.:
let (decoded, _enc, had_errors) = encoding_rs::WINDOWS_1252.decode(latin1_bytes);
What This Unlocks
- Encoding safety β encoding bugs become compiler/type errors or explicit `Result`s, not silent corruption.
- Zero-cost string views β `&str` is just a pointer + length into UTF-8 bytes; slicing is O(1).
- Interop confidence β knowing exactly where to convert means your UTF-8 core stays clean.
Key Differences
| Concept | OCaml | Rust |
|---|---|---|
| String encoding | Bytes (no guarantee) | Guaranteed UTF-8 |
| Validation | Manual | `str::from_utf8()` at boundary |
| Char type | `char` (byte in `Bytes`) | `char` (Unicode scalar, 4 bytes) |
| Other encodings | `Uutf` crate | `encoding_rs` crate |
// 483. UTF-8 encoding patterns
fn main() {
// char β UTF-8 bytes
let mut buf = [0u8; 4];
let c = 'Γ©';
let s = c.encode_utf8(&mut buf);
println!("'{}' encodes to {:?} ({} bytes)", c, &buf[..s.len()], s.len());
// Emoji
let emoji = 'π';
let n = emoji.encode_utf8(&mut buf).len();
println!("'{}' encodes to {} bytes", emoji, n);
// All chars from string as bytes
for c in "cafΓ©".chars() {
let n = c.encode_utf8(&mut buf).len();
println!(" '{}' β {:02x?}", c, &buf[..n]);
}
// Validate &[u8] as UTF-8
let valid_bytes: &[u8] = &[104, 101, 108, 108, 111];
let invalid_bytes: &[u8] = &[104, 0xFF, 111];
println!("valid: {:?}", std::str::from_utf8(valid_bytes));
println!("invalid: {}", std::str::from_utf8(invalid_bytes).is_err());
// BOM handling
let with_bom = "\u{FEFF}Hello";
let stripped = with_bom.strip_prefix('\u{FEFF}').unwrap_or(with_bom);
println!("stripped BOM: '{}'", stripped);
// Char length in bytes
for c in ['A', 'Γ©', 'δΈ', 'π'] {
println!(" '{}' β {} bytes", c, c.len_utf8());
}
}
#[cfg(test)]
mod tests {
#[test] fn test_encode() { let mut b=[0u8;4]; assert_eq!('A'.encode_utf8(&mut b), "A"); assert_eq!('Γ©'.len_utf8(), 2); }
#[test] fn test_validate() { assert!(std::str::from_utf8(&[104,105]).is_ok()); assert!(std::str::from_utf8(&[0xFF]).is_err()); }
#[test] fn test_bom() { let s="\u{FEFF}hi"; assert_eq!(s.strip_prefix('\u{FEFF}'),Some("hi")); }
}
(* 483. UTF-8 encoding β OCaml *)
let encode_utf8 codepoint =
if codepoint < 0x80 then
Bytes.create 1 |> (fun b -> Bytes.set b 0 (Char.chr codepoint); b)
else if codepoint < 0x800 then
let b = Bytes.create 2 in
Bytes.set b 0 (Char.chr (0xC0 lor (codepoint lsr 6)));
Bytes.set b 1 (Char.chr (0x80 lor (codepoint land 0x3F)));
b
else
let b = Bytes.create 3 in
Bytes.set b 0 (Char.chr (0xE0 lor (codepoint lsr 12)));
Bytes.set b 1 (Char.chr (0x80 lor ((codepoint lsr 6) land 0x3F)));
Bytes.set b 2 (Char.chr (0x80 lor (codepoint land 0x3F)));
b
let () =
let e = encode_utf8 0xE9 in (* Γ© *)
Printf.printf "Γ© bytes: %02x %02x\n" (Char.code (Bytes.get e 0)) (Char.code (Bytes.get e 1));
let s = "caf\xc3\xa9" in
Printf.printf "valid UTF-8 test: byte_len=%d\n" (String.length s)