483 Fundamental

String Encoding

Functional Programming

Tutorial

The Problem

Software systems communicate using standardised text encodings: HTTP headers are Latin-1 or UTF-8, XML files may start with a byte-order mark (BOM, U+FEFF), JSON must be UTF-8, and legacy databases often use Windows-1252. Rust's strings are always UTF-8 internally, but interfacing with the outside world requires encoding knowledge: how many bytes does a character occupy, how do I detect a BOM, how do I validate arbitrary bytes as UTF-8 before accepting them as &str?

🎯 Learning Outcomes

• Encode a single char to its UTF-8 byte representation with encode_utf8(&mut buf)

• Query the UTF-8 byte length of a char with char::len_utf8()

• Validate a byte slice as UTF-8 with std::str::from_utf8 returning Result<&str, Utf8Error>

• Detect and strip a UTF-8 BOM with strip_prefix('\u{FEFF}')

• Understand the 1/2/3/4 byte UTF-8 encoding ranges for Unicode codepoints

Code Example

#![allow(clippy::all)]
// 483. UTF-8 encoding patterns

#[cfg(test)]
mod tests {
    #[test]
    fn test_encode() {
        let mut b = [0u8; 4];
        assert_eq!('A'.encode_utf8(&mut b), "A");
        assert_eq!('é'.len_utf8(), 2);
    }
    #[test]
    fn test_validate() {
        assert!(std::str::from_utf8(&[104, 105]).is_ok());
        assert!(std::str::from_utf8(&[0xFF]).is_err());
    }
    #[test]
    fn test_bom() {
        let s = "\u{FEFF}hi";
        assert_eq!(s.strip_prefix('\u{FEFF}'), Some("hi"));
    }
}

(* 483. UTF-8 encoding – OCaml *)
let encode_utf8 codepoint =
  if codepoint < 0x80 then
    Bytes.create 1 |> (fun b -> Bytes.set b 0 (Char.chr codepoint); b)
  else if codepoint < 0x800 then
    let b = Bytes.create 2 in
    Bytes.set b 0 (Char.chr (0xC0 lor (codepoint lsr 6)));
    Bytes.set b 1 (Char.chr (0x80 lor (codepoint land 0x3F)));
    b
  else
    let b = Bytes.create 3 in
    Bytes.set b 0 (Char.chr (0xE0 lor (codepoint lsr 12)));
    Bytes.set b 1 (Char.chr (0x80 lor ((codepoint lsr 6) land 0x3F)));
    Bytes.set b 2 (Char.chr (0x80 lor (codepoint land 0x3F)));
    b

let () =
  let e = encode_utf8 0xE9 in  (* é *)
  Printf.printf "é bytes: %02x %02x\n" (Char.code (Bytes.get e 0)) (Char.code (Bytes.get e 1));
  let s = "caf\xc3\xa9" in
  Printf.printf "valid UTF-8 test: byte_len=%d\n" (String.length s)

Key Differences

Zero-copy validation: Rust's str::from_utf8 validates and returns a &str pointing into the original bytes; OCaml's equivalent requires external crates and always decodes.

**encode_utf8 to stack buffer**: Rust encodes a char into a stack-allocated [u8; 4]; OCaml's Uutf.Buffer.add_utf_8 writes to a heap Buffer.

BOM handling: Rust's strip_prefix('\u{FEFF}') handles BOM as a normal char; OCaml needs manual byte prefix matching.

**len_utf8**: Rust provides char::len_utf8() as a O(1) query; OCaml has no equivalent — you must encode and measure.

OCaml Approach

OCaml encodes/decodes UTF-8 via the Uutf library:

(* Encode a Unicode codepoint to UTF-8 bytes *)
let encode_utf8 uchar =
  let buf = Buffer.create 4 in
  Uutf.Buffer.add_utf_8 buf uchar;
  Buffer.to_bytes buf

(* Validate UTF-8 *)
let is_valid_utf8 s =
  Uutf.String.fold_utf_8 (fun ok _ d ->
    ok && d <> `Malformed) true s

OCaml has no BOM-stripping in the standard library; a manual if String.length s >= 3 && String.sub s 0 3 = "\xef\xbb\xbf" then String.sub s 3 ... check is typical.

Full Source

#![allow(clippy::all)]
// 483. UTF-8 encoding patterns

#[cfg(test)]
mod tests {
    #[test]
    fn test_encode() {
        let mut b = [0u8; 4];
        assert_eq!('A'.encode_utf8(&mut b), "A");
        assert_eq!('é'.len_utf8(), 2);
    }
    #[test]
    fn test_validate() {
        assert!(std::str::from_utf8(&[104, 105]).is_ok());
        assert!(std::str::from_utf8(&[0xFF]).is_err());
    }
    #[test]
    fn test_bom() {
        let s = "\u{FEFF}hi";
        assert_eq!(s.strip_prefix('\u{FEFF}'), Some("hi"));
    }
}

(* 483. UTF-8 encoding – OCaml *)
let encode_utf8 codepoint =
  if codepoint < 0x80 then
    Bytes.create 1 |> (fun b -> Bytes.set b 0 (Char.chr codepoint); b)
  else if codepoint < 0x800 then
    let b = Bytes.create 2 in
    Bytes.set b 0 (Char.chr (0xC0 lor (codepoint lsr 6)));
    Bytes.set b 1 (Char.chr (0x80 lor (codepoint land 0x3F)));
    b
  else
    let b = Bytes.create 3 in
    Bytes.set b 0 (Char.chr (0xE0 lor (codepoint lsr 12)));
    Bytes.set b 1 (Char.chr (0x80 lor ((codepoint lsr 6) land 0x3F)));
    Bytes.set b 2 (Char.chr (0x80 lor (codepoint land 0x3F)));
    b

let () =
  let e = encode_utf8 0xE9 in  (* é *)
  Printf.printf "é bytes: %02x %02x\n" (Char.code (Bytes.get e 0)) (Char.code (Bytes.get e 1));
  let s = "caf\xc3\xa9" in
  Printf.printf "valid UTF-8 test: byte_len=%d\n" (String.length s)

✓ Tests Rust test suite

#[cfg(test)]
mod tests {
    #[test]
    fn test_encode() {
        let mut b = [0u8; 4];
        assert_eq!('A'.encode_utf8(&mut b), "A");
        assert_eq!('é'.len_utf8(), 2);
    }
    #[test]
    fn test_validate() {
        assert!(std::str::from_utf8(&[104, 105]).is_ok());
        assert!(std::str::from_utf8(&[0xFF]).is_err());
    }
    #[test]
    fn test_bom() {
        let s = "\u{FEFF}hi";
        assert_eq!(s.strip_prefix('\u{FEFF}'), Some("hi"));
    }
}

Exercises

UTF-8 byte length table: Write a function that prints each codepoint range (U+0000–U+007F, U+0080–U+07FF, U+0800–U+FFFF, U+10000–U+10FFFF) and its byte count.

Streaming validator: Implement Utf8Validator that accepts bytes one at a time and returns Valid, Invalid, or Incomplete (for a multibyte sequence split across buffers).

BOM-aware reader: Write read_text_file(path: &Path) -> Result<String> that reads raw bytes, detects UTF-8/UTF-16 BOM, and returns a normalised UTF-8 string (transcode UTF-16 using the encoding_rs crate).

Open Source Repos

functional-rust

View the source for this example on GitHub — OCaml and Rust side by side in the repo.

Rust