osa1 github about atom

Exploring parsing APIs: adding a lexer

November 28, 2024 - Tagged as: en, parsing, rust.

In the previous post we looked at three different parsing APIs, and compared them for runtime and the use cases they support.

In this post we’ll add a lexer (or “tokenizer”), with two APIs, and for each lexer see how the parsers from the previous post perform when combined with the lexer.

What is a lexer? A lexer is very similar to the event parsers we saw in the previous post, but it doesn’t try to maintain any structure. It generates “tokens”, which are parts of the program that cannot be split into smaller parts. A lexer doesn’t care about parentheses or other delimiters being balanced, or that values in an array are separated by commas, or anything else. It simply splits the input into tokens.

Why is a lexer useful? If you already have an event parser, adding a lexer may not allow a lot of new use cases. The main use cases that I’m aware of are:

The APIs

Similar to the previous post, we will look at three different APIs for lexing:

For our simplified (and enhanced, with comments) JSON, our token type is:

pub enum Token {
    Int(u64),
    Str { size_in_bytes: usize },
    True,
    False,
    Null,
    LBracket,
    RBracket,
    LBrace,
    RBrace,
    Colon,
    Comma,
    Comment { size_in_bytes: usize },
}

Similar to our event type from the previous post, this type needs to be cheap to generate (ideally stack allocated).

The tokens are generated along with byte offsets in the input, also similar to events.

For the push API, the listener interface also directly follows the token type:

pub trait LexerEventListener {
    fn handle_int(&mut self, byte_offset: usize, i: u64);

    fn handle_str(&mut self, byte_offset: usize, size_in_bytes: usize);

    // Similar for other token types.
    ...

    fn handle_error(&mut self, byte_offset: usize);
}

To keep things simple, the event handlers don’t return a bool to stop parsing. It can be added in a few lines of code and it doesn’t affect performance.

Unlike the different types of event parsers from the previous post, implementations of these lexer APIs are almost identical. This is because the lexer has only one state, which is the current position in the input. A next call in the iterator implementation simply continues from the current location in the input, and updates the current location as it reads characters from the input.

The entry points are:

pub fn tokenize_iter<'a>(input: &'a str) -> Lexer<'a> { ... }
  // Lexer implements `Iterator`

pub fn tokenize_push<L: LexerEventListener>(input: &str, listener: &mut L) { ... }

pub fn tokenize_list(input: &str) -> Result<Vec<(usize, Token)> usize> { ... }

Combining with the event parsers

We have 3 lexers and 2 event parsers, so 6 combinations in total:

  1. tokenize_list + parse_events_iter
  2. tokenize_list + parse_events_push
  3. tokenize_iter + parse_events_iter
  4. tokenize_iter + parse_events_push
  5. tokenize_push + parse_events_iter
  6. tokenize_push + parse_events_push

However (5) is not easily possible in Rust. The problem is that a push implementation cannot be converted into an iterator, as it will scan the entire input without ever returning and keep calling the listener methods. To convert a push API into an iterator, we need a language feature that allows us to stop the current thread (or maybe a “fiber”, green thread etc.) and resume it later. In Rust, this is possible with async or threads. Threads are expensive, and async requires a lot of refactoring, and all the call sites to be made async as well.

So in this post we won’t consider this combination.

Notes on implementations

Implementing these combinations is mostly straightforward. Full code is linked below as usual. The takeaways are:

References and benchmarks

The code (including benchmarks) is here: github.com/osa1/how-to-parse-2.

Token generation benchmarks: Collect all of the tokens in a Vec.

In the event generation benchmarks in the last post, the push implementation is about 10% faster than the iterator. But in the lexer, the iterator is faster when collecting the tokens in a vector. It looks like when the state that the parser manages between the next calls gets simpler, the compiler is able to optimize the code better, and iterator implementation beats the push implementation.

The vector generator and push implementation adding the elements to a vector via the listener perform the same, which shows that when monomorphised, the push implementation optimizes quite well for simple cases (but also in complex cases, as we will see below). In languages without monomorphisation, the push API should be slower.

Tokens to events: Convert tokens to events.

The first two benchmarks are the ones from the previous post that don’t use a lexer, generate events directly. The numbers are slightly different than the numbers from the previous post as I rerun them again.

If you need some kind of incremental implementation, scanning the entire input and collecting the events or tokens in a vector performs bad. There’s no point in combining the list API with push or iterator APIs.

What’s surprising is that the push lexer implementation combined with the push event generator implementation performs better than the event generator implementation that parses the input directly without a lexer. I don’t have an explanation to why, yet.

Lexer iterator implementations combined with any of the event generation implementations perform slower than the event push implementation that parses the input directly, but about as fast as the event iterator implementation that parses the input directly.

Tokens to AST: Converts tokens to events, builds AST from the events.

The first three benchmarks below are from the last post. Rerun and included here for comparison.

When we add an AST building step, which is more complicated compared to the rest of steps, the performance difference between the most convenient implementation (tokenize_iter + events_iter to AST) and the most performant one (tokenize_push + events_push to AST) diminishes. In the event generation benchmark, the fast one is 30% faster, but when building an AST, it’s only 9% faster.

The push implementation is still faster than the recursive descent parser, even with the extra lexing step. I’m planning to investigate this further in a future post.