osa1

Exploring parsing APIs: what to generate, and how

November 22, 2024 - Tagged as: en, parsing, rust.

Consider a simplified and enhanced version of JSON, with these changes:

Numbers are 64-bit unsigned integers.
Strings cannot have control and escape characters.
Single-line comments are allowed, with the usual syntax: // ... .

When parsing a language like this, a common first step if to define an “abstract syntax tree” (AST), with only the details we want from the parser output.

For example, if we’re implementing a tool like jq, the AST may look like:

enum Json {
    Int(u64),
    Str(String),
    Bool(bool),
    Array(Vec<Json>),
    Object(Vec<(String, Json)>),
    Null,
}

This type is called an “abstract” syntax tree because it abstracts the unnecessary details from the parse output. In our tool we don’t need locations of nodes and comments, so the AST doesn’t contain them.

It’s easy to implement a parser for this AST: we iterate the input, skip whitespace and comments, then based on the next character decide what type of node (integer, string, etc.) to parse. For nested Json nodes in arrays and objects, we recursively call the parser.

This kind of parser is called a “recursive descent parser”. For our AST above, the parser looks like this:

// The entry point: parses all of the input to JSON.
pub fn parse(input: &str) -> Result<Json, JsonParseError> {
    let mut iter = input.char_indices().peekable();
    let (_, json) = parse_single(&mut iter, input)?;
    skip_trivia(&mut iter)?;
    // Check that all of the input is consumed.
    ...
}

// Parse a single Json. After parsing, the input may have more characters to be parsed.
fn parse_single(
    iter: &mut Peekable<CharIndices>,
    input: &str,
) -> Result<(usize, Json), ParseError> {
    // Skip whitespace and comments.
    skip_trivia(iter)?;

    // Get next character.
    let (byte_offset, char) = match iter.next() { ... }

    if char == '[' {
        // Parse an array. Call `parse_single` recursively for elements.
        ...
    }

    if char == '{' {
        // Parse an object. Call `parse_single` recursively for values.
        ...
    }

    if char == 't' {
        // Parse keyword "true".
        ...
    }

    // Same for other keywords, integers, strings.
    ...
}

While very common, this kind of parsers are inflexible, and slower than more flexible alternatives for many use cases.

Consider these use cases:

A JSON formatter: a formatter needs to know about comments to be able to keep them in the formatted code. To support this use case, the AST needs to include comments too, which will make it larger, and parsing will be less efficient for the applications that don’t need comments.
A configuration file parser for a text editor: to be able to show error locations in configuration errors (such as an invalid value used for a setting), the AST will have to include source locations. Similar to above, this will make the AST larger and slower to parse for other applications that don’t need source locations.
An RPC server that looks at the command name in incoming JSON messages and relays the messages based on the command name: the server doesn’t even need a full parser, just a parser that can keep track of nesting level so that it can extract the request name field at the right level will suffice. Using a full AST parser will parse the whole message and be inefficient.
A log sorting tool that reads a file with one JSON log per line, sorts the lines based on top-level “timestamp” field values. Similar to the use case above, this tool only needs to read one field and parsing whole lines is wasteful.

A well-known solution to these is to introduce a lower level parser that doesn’t generate a fully structured output like an AST, but a stream of “parse events”. These events should be general enough to allow different use cases like the ones we listed above, and should be cheap to allocate and pass around, ideally as stack allocated values, so that applications that don’t need them can skip them efficiently.

This type of parsing is often called “event driven parsing”. In our JSON variant, the events look like this:

/// A parse event, with location of the event in the input.
pub struct ParseEvent {
    pub kind: ParseEventKind,
    pub byte_offset: usize,
}

/// Details of a parse event.
pub enum ParseEventKind {
    StartObject,
    EndObject,
    StartArray,
    EndArray,
    Int(u64),
    Str {
        /// Size of the string, not including the double quotes.
        size_in_bytes: usize,
    },
    Bool(bool),
    Null,
    Comment {
        /// Size of the comment, including the "//" a the beginning and newline at the end.
        size_in_bytes: usize,
    },
}

Note that there’s no heap allocation required for these events. Contents of strings and comments can be obtained by slicing the input using the event location and size_in_bytes field.

When generating these event, it’s important that we don’t scan the whole input and collect all of the events in a list, as that would mean some of the users, like our RPC server and log sorted examples above, would have to do more work than necessary.

This means that the parser will have to be stateful: after returning an event, it needs to be able to continue from the last event location. This complicates the parser implementation quite a bit. Here’s how the parser looks like at a high level:

// The entry point. Use via the `Iterator` interface.
pub fn parse_events(input: &str) -> EventParser {
    EventParser::new(input)
}

// The parser state.
pub struct EventParser<'a> {
    input: &'a str,
    byte_offset: usize,
    container_stack: Vec<Container>,
    state: ParserState,
}

enum Container {
    Array,
    Object,
}

enum ParserState {
    /// Parse any kind of object, update state based on the current container.
    TopLevel,

    /// Finished parsing a top-level object, expect end-of-input.
    Done,

    /// Parsing an object, parse another element on ',', or finish the array on '}'.
    ObjectExpectComma,

    /// Parsing an object, parse the first element, or finish the array on ']'.
    ObjectExpectKeyValue,

    /// Parsing an object and we've just parsed a key, expect ':'.
    ObjectExpectColon,

    /// Parsing an array, parse another element on ',', or finish the array on ']'.
    ArrayExpectComma,
}

impl<'a> Iterator for EventParser<'a> {
    type Item = Result<ParseEvent, ParseError>;

    fn next(&mut self) -> Option<Self::Item> {
        match self.state {
            ParserState::TopLevel => self.top_level(),
            ParserState::Done => self.done(),
            ParserState::ObjectExpectComma => self.object_expect_comma(),
            ParserState::ObjectExpectKeyValue => self.object_expect_key_value(),
            ParserState::ObjectExpectColon => self.object_expect_colon(),
            ParserState::ArrayExpectComma => self.array_expect_comma(),
        }
    }
}

...

The main complexity of this parser comes from the fact that it cannot return an event and keep running, the caller needs to call the relevant method (next from the Iterator trait above) to keep parsing. To be able to continue from where it’s left, the parser needs to maintain some state outside of the parse functions.

This parser is general enough to allow implementing our original AST parser:

pub fn event_to_tree<I: Iterator<Item = Result<ParseEvent, ParseError>>>(
    parser: &mut I,
    input: &str,
) -> Result<Json, ParseError> {
    let mut container_stack: Vec<Container> = vec![];
    let mut current_container: Option<Container> = None;
    let mut parsed_object: Option<Json> = None;

    for event in parser.by_ref() {
        match event {
            ...
        }
    }

    Ok(parsed_object.unwrap())
}

But it also allows parsing to an AST with comments (for our formatter), source locations (for our configuration parser), and our RPC server and log sorter. Here’s how the timestamp parser that stops after finding the field looks like:

/// Parse the "timestamp" field at the top-level map of the JSON.
pub fn parse_timestamp(log_line: &str) -> Result<Option<u64>, ParseError> {
    let mut container_depth: u32 = 0;
    let mut expect_timestamp = false;

    for event in parse_events(log_line) {
        let ParseEvent { kind, byte_offset } = match event {
            Ok(event) => event,
            Err(err) => return Err(err),
        };

        let expect_timestamp_ = expect_timestamp;
        expect_timestamp = false;

        match kind {
            ParseEventKind::StartObject => {
                container_depth += 1;
            }

            ParseEventKind::EndObject => {
                container_depth -= 1;
            }

            ParseEventKind::StartArray => {
                if container_depth == 0 {
                    // Array at the top level, the line does not contain the field.
                    return Ok(None);
                }
                container_depth += 1;
            }

            ParseEventKind::EndArray => {
                container_depth -= 1;
            }

            ParseEventKind::Str { size_in_bytes } => {
                if container_depth != 1 {
                    continue;
                }
                let str = &log_line[byte_offset..byte_offset + size_in_bytes];
                expect_timestamp = str == "timestamp";
            }

            ParseEventKind::Int(i) => {
                if expect_timestamp_ {
                    return Ok(Some(i));
                }
            }

            ParseEventKind::Bool(_)
            | ParseEventKind::Null
            | ParseEventKind::Comment { .. } => {}
        }
    }

    Ok(None)
}

A nice property of this parser is that it does not allocate at all. It doesn’t build an AST (so no heap-allocated vectors), and parse events are 24-byte stack allocated values. The event parser is also stack allocated by this function.

An alternative design to this that is slightly less flexible and more difficult to use, but easier to implement and faster is what’s sometimes called a “push parser”.

The idea is that, instead of returning one event at a time, the parser takes a “listener” argument, and calls the listener callbacks for each event generated. The listener type directly follows our event type above:

// Methods return a `bool` indicating whether to continue parsing after the event.
pub trait EventListener {
    fn handle_start_object(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_end_object(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_start_array(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_end_array(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_int(&mut self, _byte_offset: usize, _i: u64) -> bool {
        true
    }

    fn handle_str(&mut self, _byte_offset: usize, _size_in_bytes: usize) -> bool {
        true
    }

    fn handle_bool(&mut self, _byte_offset: usize, _b: bool) -> bool {
        true
    }

    fn handle_null(&mut self, _byte_offset: usize) -> bool {
        true
    }

    fn handle_comment(&mut self, _byte_offset: usize, _size_in_bytes: usize) -> bool {
        true
    }

    fn handle_error(&mut self, _error: ParseError);
}

The parser:

// The entry point. Parse all of the input, call `listener` with the events.
pub fn parse<L: EventListener>(input: &str, listener: &mut L) {
    let mut iter = input.char_indices().peekable();
    let input_size = input.len();

    // Parse a single JSON.
    if !parse_single(&mut iter, input_size, listener) {
        return;
    }

    // Check that all of the input is consumed.
    ...
}

// Returns whether an error was reported.
fn parse_single<L: EventListener>(
    iter: &mut Peekable<CharIndices>,
    input_size: usize,
    listener: &mut L,
) -> bool {
    // Skip whitespace and comments, generate events for comments.
    skip_trivia!(iter, listener);

    // Get next character.
    let (byte_offset, char) = match iter.next() {
        Some(next) => next,
        None => {
            listener.handle_error(ParseError {
                byte_offset: input_size,
                reason: "unexpected end of input",
            });
            return false;
        }
    };

    if char == '[' {
        // Parse an array. Call `parse_single` recursively for elements.
        ...
    }

    if char == '{' {
        // Parse an object. Call `parse_single` recursively for values.
        ...
    }

    if char == 't' {
        // Parse keyword "true".
        ...
    }

    // Same for other keywords, integers, strings.
    ...
}

Note that the parser functions are identical (in terms of names and what they do) to our simple recursive descent parser. This is because the parser no longer needs to maintain state to be able to return and continue from where it was left, as it does all of the work in one go. Instead of building an AST or a list of events, it takes an EventListener argument and calls the handle methods.

This is a bit less convenient to use, but it’s still flexible enough to build an AST. An EventListener implementation that builds up a Json AST looks like this:

pub struct AstBuilderListener<'a> {
    input: &'a str,
    container_stack: Vec<Container>,
    current_container: Option<Container>,
    parsed_object: Option<Json>,
    error: Option<ParseError>,
}

impl<'a> EventListener for AstBuilderListener<'a> {
    ...
}

However, if you need to be able to stop parsing and continue later, this parser can’t do that.

The main advantage of this parser is that, with the right programming language and parser design, it can be faster than the alternatives, while still being flexible enough for most use cases. See below for benchmarks.

Aside: event parsing vs. lexing

Our ParseEvent type has no nested data and looks like what we could define as the “tokens” in a parser for a programming language.

So it shouldn’t be surprising that we can use a lexer generator to implement a parse event generator:

// Same `parse_events` as above, but uses a generated lexer.
pub fn parse_events(input: &str) -> LexgenIteratorAdapter {
    LexgenIteratorAdapter {
        lexer: Lexer::new(input),
    }
}

// An adapter is necessary to convert lexgen values to `parse_events` items.
pub struct LexgenIteratorAdapter<'a> {
    lexer: Lexer<'a, std::str::Chars<'a>>,
}

impl<'a> Iterator for LexgenIteratorAdapter<'a> {
    type Item = Result<ParseEvent, ParseError>;

    fn next(&mut self) -> Option<Self::Item> {
        ...
    }
}

struct LexerState {
    container_stack: Vec<Container>,
}

lexgen::lexer! {
    Lexer(LexerState) -> ParseEvent;

    type Error = &'static str;

    let comment = "//" (_ # '\n')* '\n';

    rule Init {
        $$ascii_whitespace,

        $comment => comment,

        '[' => ...,

        ']' => ...,

        '{' => ...,

        "true" => ...,

        "false" => ...,

        "null" => ...,

        ['0'-'9']+ => ...,

        '"' (_ # '"')* '"' => ...
    }

    rule Done { ... }

    rule ArrayExpectComma { ... }

    rule ObjectExpectKeyValue { ... }

    rule ObjectExpectColon { ... }

    rule ObjectExpectComma { ... }
}

This uses lexgen. lexgen generates slightly different values than what we want, so we have a map in the entry point to convert the lexgen values.

The main difference between an event parser and lexer is that an event parser maintains some of the structure of the parsed format. For example, we check that brackets are balanced, after a key in a map a colon follows, and so on.

A lexer generator can be used to implement an event parser, as demonstrated above.

References and benchmarks

All of the code in this blog post, and more, is here: github.com/osa1/how-to-parse.

In the benchmark program (run with cargo bench), we generate a 10M large JSON, and parse it to either the AST or a vector of events.

AST building benchmarks:

Recursive descent: the recursive descent parser that generates the AST.

Throughput: 128 Mb/s.
Event generator to AST: the iterator-style event generator, events processed by event_to_tree to build the AST.

Throughput: 138 Mb/s.
Lexgen event to AST: same as above, but the event parser is implemented with lexgen.

Throughput: 106 Mb/s.
Push event parser to AST: the “push” event parser, AstBuilderListener as the event listener.

Throughput: 147 Mb/s.

Event generation benchmarks: (collect events to a Vec)

Parse events: the iterator-style event generator.

Throughput: 274 Mb/s.
Parse events lexgen: the lexgen-generated event generator.

Throughput: 179 Mb/s.
Parse events via push: the push event parser, events added to a Vec via by the listener.

Throughput: 304 Mb/s.

Notes:

lexgen-generated event parser is the slowest, but I think it should be possible to make it perform at least as good as the hand-written one. So far I’ve spent very little time to optimize lexgen’s code generator.
Push-based implementation is faster than the iterator-style implementation, both for generating events in a list, and also for building an AST.

The main advantage of the push-based implementation is that the control flow is as simple as the recursive descent parsing (contained within parse functions, as opposed to externally in a struct), and does all of the parsing in one go. It looks like managing the parser state externally in a struct is not free.
I think the tradeoffs between the push-based and iterator implementations will be different in most high-level languages without control over allocations and monomorphisation.
- In the Rust implementation, events are stack allocated values, which will be heap-allocated objects in some of the other languages.
- In the push-based implementation, the parser is monomorphised based on the listener type. Both the listener and parser are stack allocated. All event handler method calls are direct calls (as opposed to virtual, or via some other dynamic invocation method), which can be inlined. None of these will be the case in, e.g., Haskell and Dart.
It would be interesting to implement the same in some other languages to see how they perform relative to each other.
I’m not sure why the recursive descent parser is not at least as fast as the push-based implementation, and not faster than the iterator-style one. If you have any insights into this, please let me know.

More use cases

The use cases described at the beginning of the post are all extracted from real-world use cases of various other formats.

Here are more use cases that require flexible and fast parser design:

“Outline” views in text editors or online code browsing tools may want to process top-level definitions, and definitions nested in class, impl, and similar blocks. Parsing the whole file to an AST would be inefficient.
Syntax-aware code search tools like sg can implement searching in only identifiers, string literals, comments with an event-based parser. This could also be implemented with a lexer.
As mentioned in a previous post, ideally a formatter, language server, compiler, and refactoring tools, should reuse as much parsing code as possible. It’s difficult to do this with an AST parser, as the AST would have too much information for each of these tools. Event-based parsing makes this easier.

Event parsing examples from programming languages

I think event-driven parsing is common in some languages when parsing data formats like XML, but less common for parsing programming languages. Two examples that I’m aware of that applies the ideas to programming languages:

rust-analyzer’s parser is a hand written one that generates events. The architecture documentation mentions that Kotlin uses a similar idea:

It is a hand-written recursive descent parser, which produces a sequence of events like “start node X”, “finish node Y”. It works similarly to kotlin’s parser, which is a good source of inspiration for dealing with syntax errors and incomplete input
Dart’s parser uses the push-based API. This parser is the only Dart language parser used by the SDK. It’s used by the analyzer, language server, compilers, and anything else that the SDK includes.

(Show comments)