September 27, 2025 - Tagged as: en, fir.
Fir formats comments by assigning comment tokens to non-comment tokens (only conceptually, not in the implementation, see below), and generating comments when formatting the tokens that “own” them.
This keeps AST nodes small. The parser doesn’t know about comments at all, and code that doesn’t care about comments don’t allocate more or run more code for comments.
Formatting source code with comments is tricky, and common suggestions like adding comments to AST nodes or generating lossless (or concrete) syntax trees (CSTs) are not feasible in real programming languages. Consider this simple Fir function:
add(x: U32, y: U32) U32:
...
This simple function, without the body, has 14 places where a comment can appear:
#|1|#
#|2|# add #|3|# (
#|4|# x #|5|# : #|6|# U32 #|7|# ,
#|8|# y #|9|# : #|10|# U32 #|11|#
) #|12|# U32 #|13|# : #|14|#
...
If I were to add comment tokens to AST nodes, about 6 of these would belong to the “function declaration” AST node:
#|1|#
#|2|# add #|3|# ( #|4|# ... ) #|12|# ... : #|14|#
...
Because each of these is in different positions in the declaration, they would need different fields in the AST node.
If you consider that a real programming language will have hundreds of different types of expression, statement, declaration, … nodes, it becomes clear that this approach is simply not feasible.
The CST approach is not too different, it just moves the inconvenience from the tree type definitions and tree allocations to the use sites of the trees.
What Fir does is much simpler: it requires no support from the parse trees. The parser doesn’t even know about comments, and the AST users that don’t care about comments also don’t need to deal with them and don’t pay any price for them (runtime or memory).
Conceptually, we assign every comment token to a non-comment token. In the example above, comments 1, 2, and 3 belong to the identifier add
. Comment 4 belongs to the token (
, and so on.
When formatting, we don’t generate text directly. Instead we format the source code token by token. In the example above, we’re formatting a function definition, so we know that there will be a left paren after the function name. But we don’t generate a “(” directly after the function name. Instead we find the token for the left paren, and format it. This formatting operation also generates comments that belong to the left paren.
Assigning comment tokens to non-comment tokens: Conceptually, every token owns:
Comment tokens before them that are not on the same line with another non-comment token.
Comment tokens after them that are on the same line with the token.
In the example above, 1 and 2 belong to the identifier add
because of the first rule, and 3 also belongs to the identifier because of the second rule.
This only leaves the trailing comments at the end of a file “unowned”, which we handle separately as their own thing.
Finding tokens of AST nodes: The formatter still operates on AST nodes and AST nodes typically don’t need any extra fields for their tokens.
Instead of adding tokens to AST nodes, we represent identifiers as their tokens. Because many AST nodes have identifiers, we can start with those tokens and scan backwards and forwards to find the other tokens of the AST node, with the comments that they own.
When an AST node doesn’t have any identifiers, or finding the tokens of the node from the identifiers is difficult, we add a field for its first (or last) token, and scan forwards (or backwards) from those tokens to find the other tokens.
For example, in Fir, as of today, type declarations are represented as this: (source)
## A type declaration: `type Vec[t]: ...`.
type TypeDecl(
## When the type is a primitive, the `prim` token.
prim_: Option[TokenIdx],
## The type name. `Vec` in the example.
name: Id,
## Type parameters of the type. `[t]` in the example.
typeParams: Vec[Id],
## Kinds of `type_params`. Filled in by kind inference.
typeParamKinds: Vec[Kind],
## Constructors of the type.
rhs: Option[TypeDeclRhs],
)
Note that this node doesn’t have a token for the type
keyword. Instead we start from name
and scan backwards. The first non-trivia token that we see will be the type
token. (source)
(The prim_
field could also be removed and we could scan backwards from the type
token. If you’re interested in contributing, we have an issue about cleaning up redundant token fields in AST nodes, which would be a good issue for getting started.)
Generating comments with tokens: I used the word “conceptually” a few times above, because in the implementation we don’t really assign comment tokens to non-comment tokens.
Instead, the function that formats a token scans backwards and forwards to find comment tokens as described by the rules above, and generates them with the token.
Scanning backwards and forwards to find other tokens and collecting comment tokens that belong to a token being formatted are quite simple. Here are the relevant code:
formatToken
takes a non-comment token to be formatted and formats the token with the comments that belong to the token.
formatToken
calls findCommentBefore
to find the first comment before it that needs to be formatted with it.
Finding the comments after it is easier, so it’s done in formatToken
directly.
nextNonTrivia
and prevNonTrivia
scan forwards and backwards from a given token to find the tokens of an AST node, as mentioned in the type declaration example above.
The trailing comments at the end of the file are not owned by any token, so they’re not formatted by default. Instead they’re handled specially by the module formatter.
Not adding tokens to the AST nodes keeps the AST nodes small (cheaper to allocate), and parser and user code simple. Use sites that don’t care about comment nodes pay no price for larger AST nodes or extra parsing code handling comments.
(There are a few open issues about Fir’s formatter, but none that are caused by the ideas explained in this post.)