November 28, 2024 - Tagged as: en, parsing, rust.
In the previous post we looked at three different parsing APIs, and compared them for runtime and the use cases they support.
In this post we’ll add a lexer (or “tokenizer”), with two APIs, and for each lexer see how the parsers from the previous post perform when combined with the lexer.
What is a lexer? A lexer is very similar to the event parsers we saw in the previous post, but it doesn’t try to maintain any structure. It generates “tokens”, which are parts of the program that cannot be split into smaller parts. A lexer doesn’t care about parentheses or other delimiters being balanced, or that values in an array are separated by commas, or anything else. It simply splits the input into tokens.
Why is a lexer useful? If you already have an event parser, adding a lexer may not allow a lot of new use cases. The main use cases that I’m aware of are:
Syntax highlighting: when higlighting syntax we don’t care about the tree structure, we care about keywords, punctuation (list separators, dots in paths etc.), delimiters (commas, bracets, brackets), and literals. A lexer gives us exactly these and nothing else.
Supporting incremental parsing: one way of incrementally update an AST is by starting re-lexing a few (often just one) tokens before the edited token, re-lexing until after the edit location, until generating a token identical to an existing token again. AST nodes of modified tokens are then marked as “modified” and re-parsed.
The details are complicated, I recommend chapter 2 of this PhD thesis for an introduction to incremental parsing.
If you need to re-parse code as it gets edited, even if you don’t need or want incremental parsing, incremental lexing is easy, it makes sense to re-lex incrementally and then parse from scratch using the incrementally updated token list, because incremental lexing is so simple.
For separating complex parsing code into smaller parts: modern languages can have complicated literal syntax, with multiple string literals with varying delimiters (like r#"..."#
syntax in Rust, or [=[...]=]
in Lua), multiple variants of comments (single line and multi-line, documentation and normal), multiple number syntaxes (with different suffixes like 123u32
in Rust, underscores to separate digits for readability) and so on.
A lexer separates handling of these from the part of the parser that deals with the program structure.
Similar to the previous post, we will look at three different APIs for lexing:
tokenize_list
.tokenize_iter
.tokenize_push
.For our simplified (and enhanced, with comments) JSON, our token type is:
pub enum Token {
Int(u64),
Str { size_in_bytes: usize },
True,
False,
Null,
LBracket,
RBracket,
LBrace,
RBrace,
Colon,
Comma,
Comment { size_in_bytes: usize },
}
Similar to our event type from the previous post, this type needs to be cheap to generate (ideally stack allocated).
The tokens are generated along with byte offsets in the input, also similar to events.
For the push API, the listener interface also directly follows the token type:
pub trait LexerEventListener {
fn handle_int(&mut self, byte_offset: usize, i: u64);
fn handle_str(&mut self, byte_offset: usize, size_in_bytes: usize);
// Similar for other token types.
...
fn handle_error(&mut self, byte_offset: usize);
}
To keep things simple, the event handlers don’t return a bool
to stop parsing. It can be added in a few lines of code and it doesn’t affect performance.
Unlike the different types of event parsers from the previous post, implementations of these lexer APIs are almost identical. This is because the lexer has only one state, which is the current position in the input. A next
call in the iterator implementation simply continues from the current location in the input, and updates the current location as it reads characters from the input.
The entry points are:
pub fn tokenize_iter<'a>(input: &'a str) -> Lexer<'a> { ... }
// Lexer implements `Iterator`
pub fn tokenize_push<L: LexerEventListener>(input: &str, listener: &mut L) { ... }
pub fn tokenize_list(input: &str) -> Result<Vec<(usize, Token)> usize> { ... }
We have 3 lexers and 2 event parsers, so 6 combinations in total:
However (5) is not easily possible in Rust. The problem is that a push implementation cannot be converted into an iterator, as it will scan the entire input without ever returning and keep calling the listener methods. To convert a push API into an iterator, we need a language feature that allows us to stop the current thread (or maybe a “fiber”, green thread etc.) and resume it later. In Rust, this is possible with async
or threads. Threads are expensive, and async
requires a lot of refactoring, and all the call sites to be made async
as well.
So in this post we won’t consider this combination.
Implementing these combinations is mostly straightforward. Full code is linked below as usual. The takeaways are:
The code (including benchmarks) is here: github.com/osa1/how-to-parse-2.
Token generation benchmarks: Collect all of the tokens in a Vec
.
In the event generation benchmarks in the last post, the push implementation is about 10% faster than the iterator. But in the lexer, the iterator is faster when collecting the tokens in a vector. It looks like when the state that the parser manages between the next
calls gets simpler, the compiler is able to optimize the code better, and iterator implementation beats the push implementation.
The vector generator and push implementation adding the elements to a vector via the listener perform the same, which shows that when monomorphised, the push implementation optimizes quite well for simple cases (but also in complex cases, as we will see below). In languages without monomorphisation, the push API should be slower.
Tokens to events: Convert tokens to events.
The first two benchmarks are the ones from the previous post that don’t use a lexer, generate events directly. The numbers are slightly different than the numbers from the previous post as I rerun them again.
If you need some kind of incremental implementation, scanning the entire input and collecting the events or tokens in a vector performs bad. There’s no point in combining the list API with push or iterator APIs.
What’s surprising is that the push lexer implementation combined with the push event generator implementation performs better than the event generator implementation that parses the input directly without a lexer. I don’t have an explanation to why, yet.
Lexer iterator implementations combined with any of the event generation implementations perform slower than the event push implementation that parses the input directly, but about as fast as the event iterator implementation that parses the input directly.
Tokens to AST: Converts tokens to events, builds AST from the events.
The first three benchmarks below are from the last post. Rerun and included here for comparison.
When we add an AST building step, which is more complicated compared to the rest of steps, the performance difference between the most convenient implementation (tokenize_iter + events_iter to AST) and the most performant one (tokenize_push + events_push to AST) diminishes. In the event generation benchmark, the fast one is 30% faster, but when building an AST, it’s only 9% faster.
The push implementation is still faster than the recursive descent parser, even with the extra lexing step. I’m planning to investigate this further in a future post.