765: CSV Parsing Without External Crates
Difficulty: 3 Level: Intermediate A complete RFC 4180-compliant CSV parser using an explicit state machine โ handles quoted fields, embedded commas, and escaped quotes.The Problem This Solves
CSV looks trivial โ split on commas, right? Then you encounter `"Smith, John"` and realize commas inside quotes are valid. Then `"She said ""hello"""` and realize quotes inside quoted fields are represented as doubled quotes. Then Windows line endings (`\r\n`). Then empty fields. The naive `split(',')` approach breaks on all of these. In production, CSV appears everywhere: exports from databases, spreadsheets, billing systems, analytics platforms. Getting the parsing wrong means silent data corruption โ you read `Smith` from a field that should be `Smith, John`, and the downstream system gets garbage. RFC 4180 defines the standard, and a correct parser follows it precisely. Writing this by hand also teaches you state machines โ a fundamental tool in systems programming. The CSV state machine has exactly three states (`Normal`, `Quoted`, `QuoteInQuoted`), and the transition logic fits in a single `match`. Understanding this pattern makes every other format parser easier to reason about.The Intuition
Think of Python's `csv.reader` โ it handles all these edge cases internally. In JavaScript, `Papa Parse` does the same. In Rust, the `csv` crate is excellent for production. But writing it by hand shows you exactly what those libraries are doing. The state machine approach is cleaner than hand-tracking indices. You process one character at a time and transition between states:- `Normal`: outside quotes โ commas separate fields, `"` starts a quoted field
- `Quoted`: inside quotes โ everything is literal, `"` might end the field or be an escape
- `QuoteInQuoted`: just saw `"` inside quotes โ if the next char is `"`, it's an escaped quote; if it's `,`, the field ended; if it's end-of-input, the field ended
How It Works in Rust
#[derive(Debug, PartialEq)]
enum State { Normal, Quoted, QuoteInQuoted }
pub fn parse_fields(line: &str) -> Vec<String> {
let mut fields = Vec::new();
let mut buf = String::new();
let mut state = State::Normal;
for ch in line.chars() {
match (&state, ch) {
// Normal: comma ends field, quote starts quoted field
(State::Normal, ',') => { fields.push(buf.clone()); buf.clear(); }
(State::Normal, '"') => { state = State::Quoted; }
(State::Normal, c) => { buf.push(c); }
// Quoted: quote might end field or be escaped
(State::Quoted, '"') => { state = State::QuoteInQuoted; }
(State::Quoted, c) => { buf.push(c); }
// Just saw closing quote โ is it escaped or end of field?
(State::QuoteInQuoted, '"') => {
buf.push('"'); // "" = escaped quote
state = State::Quoted;
}
(State::QuoteInQuoted, ',') => {
fields.push(buf.clone()); // field ended
buf.clear();
state = State::Normal;
}
(State::QuoteInQuoted, c) => {
buf.push(c); // trailing content after closing quote
state = State::Normal;
}
}
}
fields.push(buf); // last field (no trailing comma)
fields
}
// Parse typed records from rows
impl Person {
pub fn from_row(row: &[String]) -> Option<Self> {
if row.len() < 3 { return None; }
let age = row[1].trim().parse().ok()?;
Some(Person { name: row[0].clone(), age, city: row[2].clone() })
}
}
// Parse a whole CSV document
pub fn parse_csv(text: &str) -> Vec<Vec<String>> {
text.lines()
.map(|l| l.trim_end_matches('\r')) // handle Windows \r\n
.filter(|l| !l.is_empty())
.map(parse_fields)
.collect()
}
Result for `"Bob, Jr.",25,"New York"`:
- Field 1: `Bob, Jr.` (comma inside quotes โ not a separator)
- Field 2: `25`
- Field 3: `New York`
- Field 1: `a"b` (doubled quote โ single quote)
- Field 2: `c`
- Pattern matching on `(&state, ch)` โ the state and character together determine the transition
- `buf.clone()` + `buf.clear()` accumulates each field into a buffer, pushes on comma
- `trim_end_matches('\r')` handles Windows line endings without pulling in platform-specific code
- The final `fields.push(buf)` handles the last field โ CSV rows don't end with a comma
What This Unlocks
- Data import pipelines: correctly handle any CSV export from Excel, PostgreSQL `COPY`, or billing systems โ including fields with commas and embedded quotes
- State machine fluency: the same three-state pattern applies to tokenizers, protocol parsers, and format readers โ master it once, apply it everywhere
- Custom column mapping: `Person::from_row` shows how to map string columns to typed fields with proper error handling via `Option`
Key Differences
| Concept | OCaml | Rust |
|---|---|---|
| State machine | Variant type + recursive function | `enum State` + `match (&state, ch)` |
| Field accumulation | `Buffer.t` | `String` with `push` / `clear` |
| State transition | Match on `(state, char)` | Same โ match on `(&state, ch)` |
| Row parsing | `String.split_on_char ','` (naive) | State machine โ handles quotes and escapes |
| Production library | `csv-ex`, `octavius` | `csv` crate |
| Windows line endings | Manual stripping | `.trim_end_matches('\r')` |
//! # CSV Parsing Pattern
//!
//! Simple CSV parser without external dependencies.
/// A parsed CSV row
pub type Row = Vec<String>;
/// CSV parse error
#[derive(Debug, PartialEq)]
pub enum CsvError {
UnterminatedQuote(usize),
InconsistentColumns { expected: usize, got: usize, line: usize },
}
/// Parse a CSV string into rows
pub fn parse_csv(input: &str) -> Result<Vec<Row>, CsvError> {
let mut rows = Vec::new();
let mut expected_cols = None;
for (line_num, line) in input.lines().enumerate() {
if line.trim().is_empty() {
continue;
}
let row = parse_row(line, line_num)?;
match expected_cols {
None => expected_cols = Some(row.len()),
Some(n) if row.len() != n => {
return Err(CsvError::InconsistentColumns {
expected: n,
got: row.len(),
line: line_num,
});
}
_ => {}
}
rows.push(row);
}
Ok(rows)
}
/// Parse a single CSV row
fn parse_row(line: &str, line_num: usize) -> Result<Row, CsvError> {
let mut fields = Vec::new();
let mut current = String::new();
let mut in_quotes = false;
let mut chars = line.chars().peekable();
while let Some(ch) = chars.next() {
if in_quotes {
if ch == '"' {
if chars.peek() == Some(&'"') {
chars.next();
current.push('"');
} else {
in_quotes = false;
}
} else {
current.push(ch);
}
} else {
match ch {
'"' => in_quotes = true,
',' => {
fields.push(current.trim().to_string());
current = String::new();
}
_ => current.push(ch),
}
}
}
if in_quotes {
return Err(CsvError::UnterminatedQuote(line_num));
}
fields.push(current.trim().to_string());
Ok(fields)
}
/// Format rows as CSV
pub fn format_csv(rows: &[Row]) -> String {
rows.iter()
.map(|row| {
row.iter()
.map(|field| {
if field.contains(',') || field.contains('"') || field.contains('\n') {
format!("\"{}\"", field.replace('"', "\"\""))
} else {
field.clone()
}
})
.collect::<Vec<_>>()
.join(",")
})
.collect::<Vec<_>>()
.join("\n")
}
/// Parse CSV with headers, returning maps
pub fn parse_csv_with_headers(
input: &str,
) -> Result<Vec<std::collections::HashMap<String, String>>, CsvError> {
let rows = parse_csv(input)?;
if rows.is_empty() {
return Ok(Vec::new());
}
let headers = &rows[0];
let mut result = Vec::new();
for row in rows.iter().skip(1) {
let mut map = std::collections::HashMap::new();
for (i, value) in row.iter().enumerate() {
if let Some(header) = headers.get(i) {
map.insert(header.clone(), value.clone());
}
}
result.push(map);
}
Ok(result)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_simple_csv() {
let input = "a,b,c\n1,2,3\n4,5,6";
let rows = parse_csv(input).unwrap();
assert_eq!(rows.len(), 3);
assert_eq!(rows[0], vec!["a", "b", "c"]);
assert_eq!(rows[1], vec!["1", "2", "3"]);
}
#[test]
fn test_quoted_field() {
let input = r#"name,value
"hello, world",42"#;
let rows = parse_csv(input).unwrap();
assert_eq!(rows[1][0], "hello, world");
}
#[test]
fn test_escaped_quote() {
let input = "text\n\"say \"\"hello\"\"\"";
let rows = parse_csv(input).unwrap();
assert_eq!(rows[1][0], "say \"hello\"");
}
#[test]
fn test_inconsistent_columns() {
let input = "a,b,c\n1,2";
let result = parse_csv(input);
assert!(matches!(
result,
Err(CsvError::InconsistentColumns { .. })
));
}
#[test]
fn test_format_csv() {
let rows = vec![
vec!["a".to_string(), "b".to_string()],
vec!["1".to_string(), "2".to_string()],
];
let output = format_csv(&rows);
assert_eq!(output, "a,b\n1,2");
}
#[test]
fn test_with_headers() {
let input = "name,age\nAlice,30\nBob,25";
let records = parse_csv_with_headers(input).unwrap();
assert_eq!(records.len(), 2);
assert_eq!(records[0].get("name").unwrap(), "Alice");
assert_eq!(records[0].get("age").unwrap(), "30");
}
}
(* CSV parsing without external crates in OCaml *)
(* RFC 4180-compliant CSV field parser *)
let parse_fields line =
let len = String.length line in
let fields = ref [] in
let i = ref 0 in
while !i <= len do
if !i = len then begin
fields := "" :: !fields;
i := len + 1
end else if line.[!i] = '"' then begin
(* Quoted field *)
incr i;
let buf = Buffer.create 16 in
let stop = ref false in
while not !stop && !i < len do
if line.[!i] = '"' then begin
if !i + 1 < len && line.[!i + 1] = '"' then begin
Buffer.add_char buf '"';
i := !i + 2
end else begin
incr i;
stop := true
end
end else begin
Buffer.add_char buf line.[!i];
incr i
end
done;
fields := Buffer.contents buf :: !fields;
if !i < len && line.[!i] = ',' then incr i
else if !i >= len then i := len + 1
end else begin
(* Unquoted field *)
let start = !i in
while !i < len && line.[!i] <> ',' do incr i done;
fields := String.sub line start (!i - start) :: !fields;
if !i < len then incr i
else i := len + 1
end
done;
List.rev !fields
type person = { name: string; age: int; city: string }
let parse_person fields =
match fields with
| [name; age_s; city] ->
(try Some { name; age = int_of_string (String.trim age_s); city }
with Failure _ -> None)
| _ -> None
let csv = {|Name,Age,City
Alice,30,Amsterdam
"Bob, Jr.",25,"New York"
Carol,35,Berlin|}
let () =
let lines = String.split_on_char '\n' csv in
match lines with
| [] | [_] -> ()
| _header :: rows ->
List.iter (fun line ->
let fields = parse_fields line in
match parse_person fields with
| Some p -> Printf.printf "Person: %s, %d, %s\n" p.name p.age p.city
| None -> Printf.printf "Could not parse: %s\n" line
) rows