osa1

Exploring parsing APIs: adding a lexer

November 28, 2024 - Tagged as: en, parsing, rust.

In the previous post we looked at three different parsing APIs, and compared them for runtime and the use cases they support.

In this post we’ll add a lexer (or “tokenizer”), with two APIs, and for each lexer see how the parsers from the previous post perform when combined with the lexer.

What is a lexer? A lexer is very similar to the event parsers we saw in the previous post, but it doesn’t try to maintain any structure. It generates “tokens”, which are parts of the program that cannot be split into smaller parts. A lexer doesn’t care about parentheses or other delimiters being balanced, or that values in an array are separated by commas, or anything else. It simply splits the input into tokens.

Why is a lexer useful? If you already have an event parser, adding a lexer may not allow a lot of new use cases. The main use cases that I’m aware of are:

Syntax highlighting: when higlighting syntax we don’t care about the tree structure, we care about keywords, punctuation (list separators, dots in paths etc.), delimiters (commas, bracets, brackets), and literals. A lexer gives us exactly these and nothing else.
Supporting incremental parsing: one way of incrementally update an AST is by starting re-lexing a few (often just one) tokens before the edited token, re-lexing until after the edit location, until generating a token identical to an existing token again. AST nodes of modified tokens are then marked as “modified” and re-parsed.

The details are complicated, I recommend chapter 2 of this PhD thesis for an introduction to incremental parsing.

If you need to re-parse code as it gets edited, even if you don’t need or want incremental parsing, incremental lexing is easy, it makes sense to re-lex incrementally and then parse from scratch using the incrementally updated token list, because incremental lexing is so simple.
For separating complex parsing code into smaller parts: modern languages can have complicated literal syntax, with multiple string literals with varying delimiters (like r#"..."# syntax in Rust, or [=[...]=] in Lua), multiple variants of comments (single line and multi-line, documentation and normal), multiple number syntaxes (with different suffixes like 123u32 in Rust, underscores to separate digits for readability) and so on.

A lexer separates handling of these from the part of the parser that deals with the program structure.

The APIs

Similar to the previous post, we will look at three different APIs for lexing:

A lexer that generates a list of tokens directly: tokenize_list.
An iterator that generates one token at a time: tokenize_iter.
A “push” API that calls “listener” methods for the tokens: tokenize_push.

For our simplified (and enhanced, with comments) JSON, our token type is:

pub enum Token {
    Int(u64),
    Str { size_in_bytes: usize },
    True,
    False,
    Null,
    LBracket,
    RBracket,
    LBrace,
    RBrace,
    Colon,
    Comma,
    Comment { size_in_bytes: usize },
}

Similar to our event type from the previous post, this type needs to be cheap to generate (ideally stack allocated).

The tokens are generated along with byte offsets in the input, also similar to events.

For the push API, the listener interface also directly follows the token type:

pub trait LexerEventListener {
    fn handle_int(&mut self, byte_offset: usize, i: u64);

    fn handle_str(&mut self, byte_offset: usize, size_in_bytes: usize);

    // Similar for other token types.
    ...

    fn handle_error(&mut self, byte_offset: usize);
}

To keep things simple, the event handlers don’t return a bool to stop parsing. It can be added in a few lines of code and it doesn’t affect performance.

Unlike the different types of event parsers from the previous post, implementations of these lexer APIs are almost identical. This is because the lexer has only one state, which is the current position in the input. A next call in the iterator implementation simply continues from the current location in the input, and updates the current location as it reads characters from the input.

The entry points are:

pub fn tokenize_iter<'a>(input: &'a str) -> Lexer<'a> { ... }
  // Lexer implements `Iterator`

pub fn tokenize_push<L: LexerEventListener>(input: &str, listener: &mut L) { ... }

pub fn tokenize_list(input: &str) -> Result<Vec<(usize, Token)> usize> { ... }

Combining with the event parsers

We have 3 lexers and 2 event parsers, so 6 combinations in total:

tokenize_list + parse_events_iter
tokenize_list + parse_events_push
tokenize_iter + parse_events_iter
tokenize_iter + parse_events_push
tokenize_push + parse_events_iter
tokenize_push + parse_events_push

However (5) is not easily possible in Rust. The problem is that a push implementation cannot be converted into an iterator, as it will scan the entire input without ever returning and keep calling the listener methods. To convert a push API into an iterator, we need a language feature that allows us to stop the current thread (or maybe a “fiber”, green thread etc.) and resume it later. In Rust, this is possible with async or threads. Threads are expensive, and async requires a lot of refactoring, and all the call sites to be made async as well.

So in this post we won’t consider this combination.

Notes on implementations

Implementing these combinations is mostly straightforward. Full code is linked below as usual. The takeaways are:

The push API cannot be converted into an iterator API, without language features.
The push API requires state management in the consumer: the consumer will have to save the state that needs to be maintained between the calls to the listener methods.
The iterator API is more flexible as it can be converted into a push API.
The iterator API is also easier to use: the consumer can iterate through the elements in nested loops, and needs less state management. The state can also be function locals, instead of fields of a struct (or class etc.).
The list API (generates an entire vector of tokens) only makes sense when you need to collect all of the tokens in memory. The only use case for this that I’m aware of is incremental parsing.

References and benchmarks

The code (including benchmarks) is here: github.com/osa1/how-to-parse-2.

Token generation benchmarks: Collect all of the tokens in a Vec.

tokenize_list: 305 MB/s
tokenize_push: 303 MB/s
tokenize_iter: 329 MB/s

In the event generation benchmarks in the last post, the push implementation is about 10% faster than the iterator. But in the lexer, the iterator is faster when collecting the tokens in a vector. It looks like when the state that the parser manages between the next calls gets simpler, the compiler is able to optimize the code better, and iterator implementation beats the push implementation.

The vector generator and push implementation adding the elements to a vector via the listener perform the same, which shows that when monomorphised, the push implementation optimizes quite well for simple cases (but also in complex cases, as we will see below). In languages without monomorphisation, the push API should be slower.

Tokens to events: Convert tokens to events.

events_iter: 282 MB/s
events_push: 315 MB/s
tokenize_list + events_iter: 181 MB/s
tokenize_list + events_push: 187 MB/s
tokenize_iter + events_iter: 269 MB/s
tokenize_iter + events_push: 275 MB/s
tokenize_push + events_push: 351 MB/s

The first two benchmarks are the ones from the previous post that don’t use a lexer, generate events directly. The numbers are slightly different than the numbers from the previous post as I rerun them again.

If you need some kind of incremental implementation, scanning the entire input and collecting the events or tokens in a vector performs bad. There’s no point in combining the list API with push or iterator APIs.

What’s surprising is that the push lexer implementation combined with the push event generator implementation performs better than the event generator implementation that parses the input directly without a lexer. I don’t have an explanation to why, yet.

Lexer iterator implementations combined with any of the event generation implementations perform slower than the event push implementation that parses the input directly, but about as fast as the event iterator implementation that parses the input directly.

Tokens to AST: Converts tokens to events, builds AST from the events.

Recursive descent: 127 MB/s
events_iter to AST: 140 MB/s
events_push to AST: 145 MB/s
tokenize_list + events_iter to AST: 108 MB/s
tokenize_list + events_push to AST: 108 MB/s
tokenize_iter + events_iter to AST: 138 MB/s
tokenize_iter + events_push to AST: 139 MB/s
tokenize_push + events_push to AST: 151 MB/s

The first three benchmarks below are from the last post. Rerun and included here for comparison.

When we add an AST building step, which is more complicated compared to the rest of steps, the performance difference between the most convenient implementation (tokenize_iter + events_iter to AST) and the most performant one (tokenize_push + events_push to AST) diminishes. In the event generation benchmark, the fast one is 30% faster, but when building an AST, it’s only 9% faster.

The push implementation is still faster than the recursive descent parser, even with the extra lexing step. I’m planning to investigate this further in a future post.

(Show comments)