๐Ÿฆ€ Functional Rust

481: bytes() and Byte-Level Operations

Difficulty: 1 Level: Intermediate Work with raw UTF-8 bytes โ€” for binary protocols, checksums, and ASCII-only processing.

The Problem This Solves

Most string operations in Rust work at the character (Unicode scalar) level. But sometimes you need to work at the byte level: computing checksums, implementing binary protocols, parsing ASCII-only formats, or interfacing with C libraries that think in bytes. Python has `str.encode('utf-8')` to get bytes, and `bytes` is a distinct type. In Rust, a `String` is a `Vec<u8>` with a UTF-8 guarantee. You can always get at those raw bytes โ€” `s.as_bytes()` gives you a `&[u8]` view, and `s.bytes()` gives you an iterator of `u8`. The critical issue: once you go to bytes, you must stay byte-correct. UTF-8 means that modifying raw bytes can break the string. Rust's `String::from_utf8()` validates UTF-8 when converting bytes back to a string โ€” and it returns a `Result`, forcing you to handle the case where the bytes aren't valid UTF-8. `from_utf8_lossy()` replaces bad sequences with the replacement character `<0xEF><0xBF><0xBD>` instead of returning an error.

The Intuition

`.bytes()` is Python's `s.encode('utf-8')` iteration โ€” each element is a `u8` raw byte. `.as_bytes()` is a zero-copy view of the same data as a `&[u8]` slice. Going the other direction: `String::from_utf8(vec_of_bytes)` is Python's `bytes.decode('utf-8')` โ€” it validates and returns `Result<String, FromUtf8Error>`. `String::from_utf8_lossy(&bytes)` is Python's `bytes.decode('utf-8', errors='replace')`. For ASCII-only processing, bytes are safe and efficient. ASCII characters always fit in a single byte in UTF-8, so byte-level operations like `b.to_ascii_lowercase()` are both correct and fast. For any multi-language text, use `.chars()` instead.

How It Works in Rust

let s = "Hello, World!";

// bytes() โ€” iterator of u8
for b in s.bytes() {
 print!("{:02x} ", b);  // 48 65 6c 6c 6f ...
}

// as_bytes() โ€” zero-copy &[u8] view (no allocation)
let bytes: &[u8] = s.as_bytes();
let sum: u32 = bytes.iter().map(|&b| b as u32).sum();
let spaces = bytes.iter().filter(|&&b| b == b' ').count();

// ASCII-only operations on bytes (safe for ASCII)
let lower: Vec<u8> = s.bytes()
 .map(|b| b.to_ascii_lowercase())
 .collect();
let lower_str = String::from_utf8(lower).unwrap();  // "hello, world!"

// from_utf8 โ€” validates, returns Result
String::from_utf8(vec![72, 105])        // Ok("Hi")
String::from_utf8(vec![0xFF])           // Err โ€” invalid UTF-8

// from_utf8_lossy โ€” replaces bad sequences (returns Cow<str>)
let bytes = b"hell\xFF world";
let s = String::from_utf8_lossy(bytes); // "hell<replacement char> world"

// Build a String from known-good bytes
let bytes = vec![72u8, 101, 108, 108, 111];
let s = String::from_utf8(bytes).unwrap(); // "Hello"

// Literal byte string: b"hello" is &[u8]
let bs: &[u8] = b"hello";

What This Unlocks

Key Differences

ConceptOCamlRust
Byte-level access`Bytes.of_string s``s.as_bytes()` โ†’ `&[u8]`
Iterate bytes`Bytes.iter``s.bytes()`
String is bytesMutable `Bytes` separate`String` = UTF-8 `Vec<u8>` internally
Bytes to string`Bytes.to_string` (no validation!)`String::from_utf8(vec)` โ†’ `Result`
Invalid bytesNo validation`from_utf8_lossy()` โ€” replaces bad bytes
Byte literal`"\xNN"``b'\xNN'` (char) / `b"..."` (&[u8])
// 481. bytes() and byte-level operations
fn main() {
    let s = "Hello, World!";
    print!("bytes: "); for b in s.bytes() { print!("{:02x} ",b); } println!();

    // as_bytes โ€” zero-copy &[u8]
    let bytes: &[u8] = s.as_bytes();
    println!("sum={}", bytes.iter().map(|&b|b as u32).sum::<u32>());
    println!("spaces={}", bytes.iter().filter(|&&b|b==b' ').count());

    // ASCII lowercase via bytes
    let lower: Vec<u8> = s.bytes().map(|b|b.to_ascii_lowercase()).collect();
    println!("lower: {}", String::from_utf8(lower).unwrap());

    // from_utf8 โ€” validates
    println!("valid: {:?}", String::from_utf8(vec![72,105]));
    println!("invalid is_err: {}", String::from_utf8(vec![0xFF]).is_err());

    // from_utf8_lossy โ€” replaces bad bytes
    let lossy = String::from_utf8_lossy(&[104,101,0xFF,108,111]);
    println!("lossy: {}", lossy);

    // Build from bytes
    let built = String::from_utf8(vec![72,101,108,108,111]).unwrap();
    println!("built: {}", built);
}

#[cfg(test)]
mod tests {
    #[test] fn test_bytes()    { assert_eq!("hi".bytes().collect::<Vec<_>>(),vec![104,105]); }
    #[test] fn test_from()     { assert_eq!(String::from_utf8(vec![104,105]).unwrap(),"hi"); }
    #[test] fn test_invalid()  { assert!(String::from_utf8(vec![0xFF]).is_err()); }
    #[test] fn test_lossy()    { let s=String::from_utf8_lossy(&[104,0xFF,105]); assert!(s.contains('h')); }
}
(* 481. Byte-level strings โ€“ OCaml *)
let () =
  let s = "Hello, World!" in
  let b = Bytes.of_string s in
  Printf.printf "bytes: ";
  Bytes.iter (fun c -> Printf.printf "%02x " (Char.code c)) b;
  print_newline ();
  let csum = Bytes.fold_left (fun a c -> a + Char.code c) 0 b in
  Printf.printf "checksum=%d\n" csum;
  let low = Bytes.map (fun c ->
    if c >= 'A' && c <= 'Z' then Char.chr (Char.code c + 32) else c) b in
  Printf.printf "lower: %s\n" (Bytes.to_string low);
  let raw = [|72;101;108;108;111|] in
  let s2 = Bytes.init (Array.length raw) (fun i -> Char.chr raw.(i)) in
  Printf.printf "from bytes: %s\n" (Bytes.to_string s2)