osa1 github about atom

Exploring parsing APIs: what to generate, and how

November 22, 2024 - Tagged as: en, parsing, rust.

Consider a simplified and enhanced version of JSON, with these changes:

When parsing a language like this, a common first step if to define an “abstract syntax tree” (AST), with only the details we want from the parser output.

For example, if we’re implementing a tool like jq, the AST may look like:

enum Json {
    Int(u64),
    Str(String),
    Bool(bool),
    Array(Vec<Json>),
    Object(Vec<(String, Json)>),
    Null,
}

This type is called an “abstract” syntax tree because it abstracts the unnecessary details from the parse output. In our tool we don’t need locations of nodes and comments, so the AST doesn’t contain them.

It’s easy to implement a parser for this AST: we iterate the input, skip whitespace and comments, then based on the next character decide what type of node (integer, string, etc.) to parse. For nested Json nodes in arrays and objects, we recursively call the parser.

This kind of parser is called a “recursive descent parser”. For our AST above, the parser looks like this:

// The entry point: parses all of the input to JSON.
pub fn parse(input: &str) -> Result<Json, JsonParseError> {
    let mut iter = input.char_indices().peekable();
    let (_, json) = parse_single(&mut iter, input)?;
    skip_trivia(&mut iter)?;
    // Check that all of the input is consumed.
    ...
}

// Parse a single Json. After parsing, the input may have more characters to be parsed.
fn parse_single(
    iter: &mut Peekable<CharIndices>,
    input: &str,
) -> Result<(usize, Json), ParseError> {
    // Skip whitespace and comments.
    skip_trivia(iter)?;

    // Get next character.
    let (byte_offset, char) = match iter.next() { ... }

    if char == '[' {
        // Parse an array. Call `parse_single` recursively for elements.
        ...
    }

    if char == '{' {
        // Parse an object. Call `parse_single` recursively for values.
        ...
    }

    if char == 't' {
        // Parse keyword "true".
        ...
    }

    // Same for other keywords, integers, strings.
    ...
}

While very common, this kind of parsers are inflexible, and slower than more flexible alternatives for many use cases.

Consider these use cases:

A well-known solution to these is to introduce a lower level parser that doesn’t generate a fully structured output like an AST, but a stream of “parse events”. These events should be general enough to allow different use cases like the ones we listed above, and should be cheap to allocate and pass around, ideally as stack allocated values, so that applications that don’t need them can skip them efficiently.

This type of parsing is often called “event driven parsing”. In our JSON variant, the events look like this:

/// A parse event, with location of the event in the input.
pub struct ParseEvent {
    pub kind: ParseEventKind,
    pub byte_offset: usize,
}

/// Details of a parse event.
pub enum ParseEventKind {
    StartObject,
    EndObject,
    StartArray,
    EndArray,
    Int(u64),
    Str {
        /// Size of the string, not including the double quotes.
        size_in_bytes: usize,
    },
    Bool(bool),
    Null,
    Comment {
        /// Size of the comment, including the "//" a the beginning and newline at the end.
        size_in_bytes: usize,
    },
}

Note that there’s no heap allocation required for these events. Contents of strings and comments can be obtained by slicing the input using the event location and size_in_bytes field.

When generating these event, it’s important that we don’t scan the whole input and collect all of the events in a list, as that would mean some of the users, like our RPC server and log sorted examples above, would have to do more work than necessary.

This means that the parser will have to be stateful: after returning an event, it needs to be able to continue from the last event location. This complicates the parser implementation quite a bit. Here’s how the parser looks like at a high level:

// The entry point. Use via the `Iterator` interface.
pub fn parse_events(input: &str) -> EventParser {
    EventParser::new(input)
}

// The parser state.
pub struct EventParser<'a> {
    input: &'a str,
    byte_offset: usize,
    container_stack: Vec<Container>,
    state: ParserState,
}

enum Container {
    Array,
    Object,
}

enum ParserState {
    /// Parse any kind of object, update state based on the current container.
    TopLevel,

    /// Finished parsing a top-level object, expect end-of-input.
    Done,

    /// Parsing an object, parse another element on ',', or finish the array on '}'.
    ObjectExpectComma,

    /// Parsing an object, parse the first element, or finish the array on ']'.
    ObjectExpectKeyValue,

    /// Parsing an object and we've just parsed a key, expect ':'.
    ObjectExpectColon,

    /// Parsing an array, parse another element on ',', or finish the array on ']'.
    ArrayExpectComma,
}

impl<'a> Iterator for EventParser<'a> {
    type Item = Result<ParseEvent, ParseError>;

    fn next(&mut self) -> Option<Self::Item> {
        match self.state {
            ParserState::TopLevel => self.top_level(),
            ParserState::Done => self.done(),
            ParserState::ObjectExpectComma => self.object_expect_comma(),
            ParserState::ObjectExpectKeyValue => self.object_expect_key_value(),
            ParserState::ObjectExpectColon => self.object_expect_colon(),
            ParserState::ArrayExpectComma => self.array_expect_comma(),
        }
    }
}

...

The main complexity of this parser comes from the fact that it cannot return an event and keep running, the caller needs to call the relevant method (next from the Iterator trait above) to keep parsing. To be able to continue from where it’s left, the parser needs to maintain some state outside of the parse functions.

This parser is general enough to allow implementing our original AST parser:

pub fn event_to_tree<I: Iterator<Item = Result<ParseEvent, ParseError>>>(
    parser: &mut I,
    input: &str,
) -> Result<Json, ParseError> {
    let mut container_stack: Vec<Container> = vec![];
    let mut current_container: Option<Container> = None;
    let mut parsed_object: Option<Json> = None;

    for event in parser.by_ref() {
        match event {
            ...
        }
    }

    Ok(parsed_object.unwrap())
}

But it also allows parsing to an AST with comments (for our formatter), source locations (for our configuration parser), and our RPC server and log sorter. Here’s how the timestamp parser that stops after finding the field looks like:

/// Parse the "timestamp" field at the top-level map of the JSON.
pub fn parse_timestamp(log_line: &str) -> Result<Option<u64>, ParseError> {
    let mut container_depth: u32 = 0;
    let mut expect_timestamp = false;

    for event in parse_events(log_line) {
        let ParseEvent { kind, byte_offset } = match event {
            Ok(event) => event,
            Err(err) => return Err(err),
        };

        let expect_timestamp_ = expect_timestamp;
        expect_timestamp = false;

        match kind {
            ParseEventKind::StartObject => {
                container_depth += 1;
            }

            ParseEventKind::EndObject => {
                container_depth -= 1;
            }

            ParseEventKind::StartArray => {
                if container_depth == 0 {
                    // Array at the top level, the line does not contain the field.
                    return Ok(None);
                }
                container_depth += 1;
            }

            ParseEventKind::EndArray => {
                container_depth -= 1;
            }

            ParseEventKind::Str { size_in_bytes } => {
                if container_depth != 1 {
                    continue;
                }
                let str = &log_line[byte_offset..byte_offset + size_in_bytes];
                expect_timestamp = str == "timestamp";
            }

            ParseEventKind::Int(i) => {
                if expect_timestamp_ {
                    return Ok(Some(i));
                }
            }

            ParseEventKind::Bool(_)
            | ParseEventKind::Null
            | ParseEventKind::Comment { .. } => {}
        }
    }

    Ok(None)
}

A nice property of this parser is that it does not allocate at all. It doesn’t build an AST (so no heap-allocated vectors), and parse events are 24-byte stack allocated values. The event parser is also stack allocated by this function.

An alternative design to this that is slightly less flexible and more difficult to use, but easier to implement and faster is what’s sometimes called a “push parser”.

The idea is that, instead of returning one event at a time, the parser takes a “listener” argument, and calls the listener callbacks for each event generated. The listener type directly follows our event type above:

// Methods return a `bool` indicating whether to continue parsing after the event.
pub trait EventListener {
    fn handle_start_object(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_end_object(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_start_array(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_end_array(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_int(&mut self, _byte_offset: usize, _i: u64) -> bool {
        true
    }

    fn handle_str(&mut self, _byte_offset: usize, _size_in_bytes: usize) -> bool {
        true
    }

    fn handle_bool(&mut self, _byte_offset: usize, _b: bool) -> bool {
        true
    }

    fn handle_null(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_comment(&mut self, _byte_offset: usize, _size_in_bytes: usize) -> bool {
        true
    }

    fn handle_error(&mut self, _error: ParseError);
}

The parser:

// The entry point. Parse all of the input, call `listener` with the events.
pub fn parse<L: EventListener>(input: &str, listener: &mut L) {
    let mut iter = input.char_indices().peekable();
    let input_size = input.len();

    // Parse a single JSON.
    if !parse_single(&mut iter, input_size, listener) {
        return;
    }

    // Check that all of the input is consumed.
    ...
}

// Returns whether an error was reported.
fn parse_single<L: EventListener>(
    iter: &mut Peekable<CharIndices>,
    input_size: usize,
    listener: &mut L,
) -> bool {
    // Skip whitespace and comments, generate events for comments.
    skip_trivia!(iter, listener);

    // Get next character.
    let (byte_offset, char) = match iter.next() {
        Some(next) => next,
        None => {
            listener.handle_error(ParseError {
                byte_offset: input_size,
                reason: "unexpected end of input",
            });
            return false;
        }
    };

    if char == '[' {
        // Parse an array. Call `parse_single` recursively for elements.
        ...
    }

    if char == '{' {
        // Parse an object. Call `parse_single` recursively for values.
        ...
    }

    if char == 't' {
        // Parse keyword "true".
        ...
    }

    // Same for other keywords, integers, strings.
    ...
}

Note that the parser functions are identical (in terms of names and what they do) to our simple recursive descent parser. This is because the parser no longer needs to maintain state to be able to return and continue from where it was left, as it does all of the work in one go. Instead of building an AST or a list of events, it takes an EventListener argument and calls the handle methods.

This is a bit less convenient to use, but it’s still flexible enough to build an AST. An EventListener implementation that builds up a Json AST looks like this:

pub struct AstBuilderListener<'a> {
    input: &'a str,
    container_stack: Vec<Container>,
    current_container: Option<Container>,
    parsed_object: Option<Json>,
    error: Option<ParseError>,
}

impl<'a> EventListener for AstBuilderListener<'a> {
    ...
}

However, if you need to be able to stop parsing and continue later, this parser can’t do that.

The main advantage of this parser is that, with the right programming language and parser design, it can be faster than the alternatives, while still being flexible enough for most use cases. See below for benchmarks.


Aside: event parsing vs. lexing

Our ParseEvent type has no nested data and looks like what we could define as the “tokens” in a parser for a programming language.

So it shouldn’t be surprising that we can use a lexer generator to implement a parse event generator:

// Same `parse_events` as above, but uses a generated lexer.
pub fn parse_events(input: &str) -> LexgenIteratorAdapter {
    LexgenIteratorAdapter {
        lexer: Lexer::new(input),
    }
}

// An adapter is necessary to convert lexgen values to `parse_events` items.
pub struct LexgenIteratorAdapter<'a> {
    lexer: Lexer<'a, std::str::Chars<'a>>,
}

impl<'a> Iterator for LexgenIteratorAdapter<'a> {
    type Item = Result<ParseEvent, ParseError>;

    fn next(&mut self) -> Option<Self::Item> {
        ...
    }
}

struct LexerState {
    container_stack: Vec<Container>,
}

lexgen::lexer! {
    Lexer(LexerState) -> ParseEvent;

    type Error = &'static str;

    let comment = "//" (_ # '\n')* '\n';

    rule Init {
        $$ascii_whitespace,

        $comment => comment,

        '[' => ...,

        ']' => ...,

        '{' => ...,

        "true" => ...,

        "false" => ...,

        "null" => ...,

        ['0'-'9']+ => ...,

        '"' (_ # '"')* '"' => ...
    }

    rule Done { ... }

    rule ArrayExpectComma { ... }

    rule ObjectExpectKeyValue { ... }

    rule ObjectExpectColon { ... }

    rule ObjectExpectComma { ... }
}

This uses lexgen. lexgen generates slightly different values than what we want, so we have a map in the entry point to convert the lexgen values.

The main difference between an event parser and lexer is that an event parser maintains some of the structure of the parsed format. For example, we check that brackets are balanced, after a key in a map a colon follows, and so on.

A lexer generator can be used to implement an event parser, as demonstrated above.


References and benchmarks

All of the code in this blog post, and more, is here: github.com/osa1/how-to-parse.

In the benchmark program (run with cargo bench), we generate a 10M large JSON, and parse it to either an AST or a vector of events.

AST building benchmarks:

Event generation benchmarks: (collect events in a Vec)

Notes:

More use cases

The use cases described at the beginning of the post are all extracted from real-world use cases of various other formats.

Here are more use cases that require flexible and fast parser design:

Event parsing examples from programming languages

I think event-driven parsing is common in some languages when parsing data formats like XML, but less common for parsing programming languages. Two examples that I’m aware of that applies the ideas to programming languages: