The first thing we need to do is teach the lexer to recognise comments. We’ll begin with a test:
// lexer.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn lex_comment() {
check("# foo", SyntaxKind::Comment);
}
}
Here’s the implementation:
pub(crate) enum SyntaxKind {
// snip
#[regex("#.*")]
Comment,
#[error]
Error,
Root,
BinaryExpr,
PrefixExpr,
}
Take note of how we aren’t using #[logos::skip]
here; instead, we are explicitly including comments in the output of our lexer. We do this to ensure that the parser fully contains the input text, which makes the parser lossless. This makes implementing tools that interact with the source text (a good example is automatic refactorings in an IDE) easier to implement.
Just like with whitespace, it would be nice if we don’t have to manually handle comments in the parser. We could add extra checks to our existing eat_whitespace
methods on Sink
and Source
for comments, but that’s annoying. What if we have other token kinds that we want to automatically skip in future?
There’s a name for this kind of irrelevant token: trivia. As far as I can tell, the term comes from Roslyn. Let’s add an is_trivia
method to SyntaxKind
to abstract away this behaviour:
impl SyntaxKind {
pub(crate) fn is_trivia(self) -> bool {
matches!(self, Self::Whitespace | Self::Comment)
}
}
Note how the method takes self
; this is because it’s more efficient to pass SyntaxKind
by value instead of by reference, since the size of SyntaxKind
is one byte, which is less than the size of a reference (eight bytes on 64-bit systems). Also note that is_trivia
won’t consume the instance of SyntaxKind
, since SyntaxKind
is Copy
.
Now that we have a way to ask a SyntaxKind
if it is trivia, we can use this method in Sink
and Source
:
// source.rs
impl<'l, 'input> Source<'l, 'input> {
// snip
pub(super) fn next_lexeme(&mut self) -> Option<&'l Lexeme<'input>> {
self.eat_trivia();
let lexeme = self.lexemes.get(self.cursor)?;
self.cursor += 1;
Some(lexeme)
}
pub(super) fn peek_kind(&mut self) -> Option<SyntaxKind> {
self.eat_trivia();
self.peek_kind_raw()
}
fn eat_trivia(&mut self) {
while self.at_trivia() {
self.cursor += 1;
}
}
fn at_trivia(&self) -> bool {
self.peek_kind_raw().map_or(false, SyntaxKind::is_trivia)
}
// snip
}
// sink.rs
impl<'l, 'input> Sink<'l, 'input> {
// snip
pub(super) fn finish(mut self) -> GreenNode {
// snip
for event in reordered_events {
match event {
// snip
}
self.eat_trivia();
}
// snip
}
fn eat_trivia(&mut self) {
while let Some(lexeme) = self.lexemes.get(self.cursor) {
if !lexeme.kind.is_trivia() {
break;
}
self.token(lexeme.kind, lexeme.text.into());
}
}
// snip
}
Let’s write a test to find out if what we’ve made works:
// parser.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_comment() {
check(
"# hello!",
expect![[r##"
Root@0..8
Comment@0..8 "# hello!""##]],
);
}
}
The usage of an extra #
in the raw string literal is to stop Rust from thinking that the "#
in Comment@0..8 "#
is meant to end the string literal.
$ cargo t -q
running 34 tests
..................................
test result: ok. 34 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s try parsing a binary expression interspersed with comments:
// expr.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_binary_expression_interspersed_with_comments() {
check(
"
1
+ 1 # Add one
+ 10 # Add ten",
expect![[r##"
Root@0..35
Whitespace@0..1 "\n"
BinaryExpr@1..35
BinaryExpr@1..21
Number@1..2 "1"
Whitespace@2..5 "\n "
Plus@5..6 "+"
Whitespace@6..7 " "
Number@7..8 "1"
Whitespace@8..9 " "
Comment@9..18 "# Add one"
Whitespace@18..21 "\n "
Plus@21..22 "+"
Whitespace@22..23 " "
Number@23..25 "10"
Whitespace@25..26 " "
Comment@26..35 "# Add ten""##]],
);
}
// snip
}
The test fails, since we aren’t lexing newlines. Let’s write a test for this:
// lexer.rs
#[cfg(test)]
mod tests {
use super::*;
fn check(input: &str, kind: SyntaxKind) {
// snip
}
#[test]
fn lex_spaces_and_newlines() {
check(" \n ", SyntaxKind::Whitespace);
}
// snip
}
pub(crate) enum SyntaxKind {
#[regex("[ \n]+")]
Whitespace,
// snip
}
All our tests pass now:
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
In the next part we’ll introduce another new concept to our parser: markers.