Part Fourteen: Comments · Make A Language

The first thing we need to do is teach the lexer to recognise comments. We’ll begin with a test:

// lexer.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn lex_comment() {
        check("# foo", SyntaxKind::Comment);
    }
}

Here’s the implementation:

pub(crate) enum SyntaxKind {
    // snip

    #[regex("#.*")]
    Comment,

    #[error]
    Error,

    Root,
    BinaryExpr,
    PrefixExpr,
}

Take note of how we aren’t using #[logos::skip] here; instead, we are explicitly including comments in the output of our lexer. We do this to ensure that the parser fully contains the input text, which makes the parser lossless. This makes implementing tools that interact with the source text (a good example is automatic refactorings in an IDE) easier to implement.

Just like with whitespace, it would be nice if we don’t have to manually handle comments in the parser. We could add extra checks to our existing eat_whitespace methods on Sink and Source for comments, but that’s annoying. What if we have other token kinds that we want to automatically skip in future?

There’s a name for this kind of irrelevant token: trivia. As far as I can tell, the term comes from Roslyn. Let’s add an is_trivia method to SyntaxKind to abstract away this behaviour:

impl SyntaxKind {
    pub(crate) fn is_trivia(self) -> bool {
        matches!(self, Self::Whitespace | Self::Comment)
    }
}

Note how the method takes self; this is because it’s more efficient to pass SyntaxKind by value instead of by reference, since the size of SyntaxKind is one byte, which is less than the size of a reference (eight bytes on 64-bit systems). Also note that is_trivia won’t consume the instance of SyntaxKind, since SyntaxKind is Copy.

Now that we have a way to ask a SyntaxKind if it is trivia, we can use this method in Sink and Source:

// source.rs

impl<'l, 'input> Source<'l, 'input> {
    // snip

    pub(super) fn next_lexeme(&mut self) -> Option<&'l Lexeme<'input>> {
        self.eat_trivia();

        let lexeme = self.lexemes.get(self.cursor)?;
        self.cursor += 1;

        Some(lexeme)
    }

    pub(super) fn peek_kind(&mut self) -> Option<SyntaxKind> {
        self.eat_trivia();
        self.peek_kind_raw()
    }

    fn eat_trivia(&mut self) {
        while self.at_trivia() {
            self.cursor += 1;
        }
    }

    fn at_trivia(&self) -> bool {
        self.peek_kind_raw().map_or(false, SyntaxKind::is_trivia)
    }

    // snip
}

// sink.rs

impl<'l, 'input> Sink<'l, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        // snip

        for event in reordered_events {
            match event {
                // snip
            }

            self.eat_trivia();
        }

        // snip
    }

    fn eat_trivia(&mut self) {
        while let Some(lexeme) = self.lexemes.get(self.cursor) {
            if !lexeme.kind.is_trivia() {
                break;
            }

            self.token(lexeme.kind, lexeme.text.into());
        }
    }

    // snip
}

Let’s write a test to find out if what we’ve made works:

// parser.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_comment() {
        check(
            "# hello!",
            expect![[r##"
Root@0..8
  Comment@0..8 "# hello!""##]],
        );
    }
}

The usage of an extra # in the raw string literal is to stop Rust from thinking that the "# in Comment@0..8 "# is meant to end the string literal.

$ cargo t -q
running 34 tests
..................................
test result: ok. 34 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s try parsing a binary expression interspersed with comments:

// expr.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_binary_expression_interspersed_with_comments() {
        check(
            "
1
  + 1 # Add one
  + 10 # Add ten",
            expect![[r##"
Root@0..35
  Whitespace@0..1 "\n"
  BinaryExpr@1..35
    BinaryExpr@1..21
      Number@1..2 "1"
      Whitespace@2..5 "\n  "
      Plus@5..6 "+"
      Whitespace@6..7 " "
      Number@7..8 "1"
      Whitespace@8..9 " "
      Comment@9..18 "# Add one"
      Whitespace@18..21 "\n  "
    Plus@21..22 "+"
    Whitespace@22..23 " "
    Number@23..25 "10"
    Whitespace@25..26 " "
    Comment@26..35 "# Add ten""##]],
        );
    }

    // snip
}

The test fails, since we aren’t lexing newlines. Let’s write a test for this:

// lexer.rs

#[cfg(test)]
mod tests {
    use super::*;

    fn check(input: &str, kind: SyntaxKind) {
        // snip
    }

    #[test]
    fn lex_spaces_and_newlines() {
        check("  \n ", SyntaxKind::Whitespace);
    }

    // snip
}

pub(crate) enum SyntaxKind {
    #[regex("[ \n]+")]
    Whitespace,

    // snip
}

All our tests pass now:

$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

In the next part we’ll introduce another new concept to our parser: markers.