return to main site

Part Eighteen: Errors

So far, our parser has only concerned itself with input that is correct. In a world of IDEs and language servers, however, the input to our parser will be wrong more often than not. Therefore, it is important that our parser can:

Unfortunately for us, the only thing our parser currently supports is arithmetic expressions, which don’t make error recovery techniques easy to demonstrate. Let’s first implement a parser for variable definitions.

Variable definitions

Since variable definitions aren’t expressions, we shouldn’t put them in the expr module. Let’s create a new stmt module for statements:

// grammar.rs

mod expr;
mod stmt;
// crates/parser/src/grammar/stmt.rs

use super::*;

pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
    todo!()
}

We’ll use the same syntax for variable definitions as we used before the rewrite:

let name = expr

A test:

#[cfg(test)]
mod tests {
    use crate::check;
    use expect_test::expect;

    #[test]
    fn parse_variable_definition() {
        check(
            "let foo = bar",
            expect![[r#"
Root@0..13
  VariableDef@0..13
    LetKw@0..3 "let"
    Whitespace@3..4 " "
    Ident@4..7 "foo"
    Whitespace@7..8 " "
    Equals@8..9 "="
    Whitespace@9..10 " "
    VariableRef@10..13
      Ident@10..13 "bar""#]],
        );
    }
}

We can now implement parsing of variable definitions:

pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
    match p.peek() {
        Some(SyntaxKind::LetKw) => variable_def(p),
        _ => None,
    }
}

fn variable_def(p: &mut Parser) -> Option<CompletedMarker> {
    assert!(p.at(SyntaxKind::LetKw));
    let m = p.start();
    p.bump();

    assert!(p.at(SyntaxKind::Ident));
    p.bump();

    assert!(p.at(SyntaxKind::Equals));
    p.bump();

    expr::expr(p)?;

    Some(m.complete(p, SyntaxKind::VariableDef))
}

Instead of handling errors, we repeatedly assert the token we’re at; since we know at that point the token we’re at, we can just bump past it.

Let’s quickly define SyntaxKind::VariableDef:

// crates/syntax/src/lib.rs

#[derive(Debug, Copy, Clone, PartialEq, FromPrimitive, ToPrimitive)]
pub enum SyntaxKind {
    // snip
    Root,
    InfixExpr,
    Literal,
    ParenExpr,
    PrefixExpr,
    VariableDef,
    VariableRef,
}

Our parser compiles, but we get a failing test because root doesn’t use stmt. The solution is to have stmt fall back to calling expr if it isn’t sure what it’s looking at:

// crates/parser/src/grammar/stmt.rs

pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
    match p.peek() {
        Some(SyntaxKind::LetKw) => variable_def(p),
        _ => expr::expr(p),
    }
}

We can now call stmt from root:

// grammar.rs

pub(crate) fn root(p: &mut Parser) -> CompletedMarker {
    let m = p.start();
    stmt::stmt(p);

    m.complete(p, SyntaxKind::Root)
}
$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 18 tests
...............F..
failures:

---- grammar::stmt::tests::parse_variable_definition stdout ----
thread 'grammar::stmt::tests::parse_variable_definition' panicked at 'Markers need to be completed', /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/drop_bomb-0.1.5/src/lib.rs:113:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    grammar::stmt::tests::parse_variable_definition

test result: FAILED. 17 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out

What seems to have happened is that expr has returned None, causing variable_def to return early, which in turn means the marker isn’t completed. Since we don’t care about error handling yet, we can add an unwrap:

// stmt.rs

pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
    match p.peek() {
        Some(SyntaxKind::LetKw) => Some(variable_def(p)),
        _ => expr::expr(p),
    }
}

fn variable_def(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::LetKw));
    let m = p.start();
    p.bump();

    assert!(p.at(SyntaxKind::Ident));
    p.bump();

    assert!(p.at(SyntaxKind::Equals));
    p.bump();

    expr::expr(p).unwrap();

    m.complete(p, SyntaxKind::VariableDef)
}
$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 18 tests
..............F...
failures:

---- grammar::stmt::tests::parse_variable_definition stdout ----
thread 'grammar::stmt::tests::parse_variable_definition' panicked at 'called `Option::unwrap()` on a `None` value', crates/parser/src/grammar/stmt.rs:21:19
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    grammar::stmt::tests::parse_variable_definition

test result: FAILED. 17 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out

Huh, that’s weird; why is expr returning None? The only places where it can do so are let lhs = lhs()?; and if it doesn’t see an operator. Oh, wait.

We should be returning lhs at that point instead:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?; // we’ll handle errors later.

    loop {
        let op = match p.peek() {
            // snip
            _ => break, // we’ll handle errors later.
        };

        // snip
    }

    Some(lhs)
}
$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

A short intermission

Let’s say the parser is being used as part of a language server, and the user has typed out a variable definition:

let b = a

They realise they haven’t defined a, so they define it above:

let a = 10
let b = a

While the user is in the middle of typing that, our parser sees this:

let a =
let b = a

What we want is for our parser to recognise two variable definitions, with the first missing an expression. This way, language intelligence like Go To Definition can continue working in spite of an incomplete file; if we tell the editor to ‘go to the definition of a’, it will know where to go because it’s recognised two separate variable definitions. The parser would also report an error on the second let keyword along the lines of

expected number, identifier, ‘-’ or ‘(’, but found ‘let’

Our goal for the rest of this part is to make Eldiro have this behaviour.

A naive strategy for error recovery

This is the simplest of all the error recovery techniques; if you expect to see a Foo, but see something else, you skip tokens until you find a Foo. This is simple to implement but doesn’t deliver the best output, since it throws away a lot of possibly-useful tokens. For example, all the underlined tokens here are thrown away:

let a =
let b = a
-------

Here the IDE’s Go To Definition on a wouldn’t work, and the user would get an ‘undefined variable a’ error. Evidently, this error recovery strategy doesn’t let us achieve the ideal case we laid out before.

A better way

The idea is to start with the above approach, but don’t skip certain tokens. For example, let’s say we have the following code:

{
  let a = {
    let bar =
  }
  10
}

The naive approach would start looking for an expression after the let bar =, and would stop throwing away tokens at 10. It now consumes the final } to close the let a = { block. This doesn’t leave a } to close the surrounding block, though, which means the user gets an ‘unclosed {’ error. Here’s roughly what the parser sees if you don’t include errors and fix the indentation:

{
  let a = {
    let bar = 10
  }

With the fancier approach, we can tell the parser not to skip } tokens, since those are important for recognising program structure. In languages with explicit statement terminators, parsers are often constructed so that ; is never skipped. These tokens that aren’t to be skipped by the parser during error recovery are called ‘recovery sets’.

We can improve on this approach by only ever skipping one token, which means we throw away even less of the input.

Remember that ideal example from earlier?

let a =
let b = a

We can achieve the goal of recognising two variable declarations by including let in the parser’s recovery set. This way, whenever the parser encounters an unexpected let, it will stop parsing what it’s up to and keep returning from subparsers until it gets to a state in which let can be recognised.

Implementing error recovery

Let’s write a test for the case we just discussed:

// stmt.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn recover_on_let_token() {
        check(
            "let a =\nlet b = a",
            expect![[r#"
Root@0..17
  VariableDef@0..8
    LetKw@0..3 "let"
    Whitespace@3..4 " "
    Ident@4..5 "a"
    Whitespace@5..6 " "
    Equals@6..7 "="
    Whitespace@7..8 "\n"
  VariableDef@8..17
    LetKw@8..11 "let"
    Whitespace@11..12 " "
    Ident@12..13 "b"
    Whitespace@13..14 " "
    Equals@14..15 "="
    Whitespace@15..16 " "
    VariableRef@16..17
      Ident@16..17 "a""#]],
        );
    }
}

Let’s write an expect method on Parser that takes a given SyntaxKind and bumps past it if the parser is up to it; if it isn’t, then we might skip the token depending on whether it’s part of the recovery set:

// parser.rs

const RECOVERY_SET: [SyntaxKind; 1] = [SyntaxKind::LetKw];

// snip

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn expect(&mut self, kind: SyntaxKind) {
        if self.at(kind) || !self.peek().map_or(false, |k| RECOVERY_SET.contains(&k)) {
            self.bump();
        }
    }

    // snip
}

Let’s use this method in variable_def:

// stmt.rs

fn variable_def(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::LetKw));
    let m = p.start();
    p.bump();

    p.expect(SyntaxKind::Ident);
    p.expect(SyntaxKind::Equals);

    expr::expr(p).unwrap();

    m.complete(p, SyntaxKind::VariableDef)
}

Although this does make the code cleaner, the test still fails. After all, the part that’s breaking it is in expr, not here!

Let’s write an error method that we can call if expr has exhausted all possibilities:

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn expect(&mut self, kind: SyntaxKind) {
        if self.at(kind) {
            self.bump();
        } else {
            self.error();
        }
    }

    pub(crate) fn error(&mut self) {
        if !self.at_set(&RECOVERY_SET) {
            self.bump();
        }
    }

    // snip

    fn at_set(&mut self, set: &[SyntaxKind]) -> bool {
        self.peek().map_or(false, |k| set.contains(&k))
    }

    // snip
}

We’ve reimplemented Parser::expect in terms of Parser::error, and we’ve also defined a small helper method that checks whether the parser is currently at a given set of SyntaxKinds. Let’s make use of this code:

// expr.rs

fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
    let cm = match p.peek() {
        Some(SyntaxKind::Number) => literal(p),
        Some(SyntaxKind::Ident) => variable_ref(p),
        Some(SyntaxKind::Minus) => prefix_expr(p),
        Some(SyntaxKind::LParen) => paren_expr(p),
        _ => {
            p.error();
            return None;
        }
    };

    Some(cm)
}

We now have multiple failing tests! This is because we are trying to call p.bump() when we’ve arrived at the end of the input. We could add a check for if we are at the end of the input and not call p.error() then, but that would be incorrect. Think about it: if we’re trying to parse the left-hand side of an expression and are at the end of the input, that’s rightly an error. Instead, we should avoid calling Parser::bump in Parser::error if we’re at the end of the input.

// parser.rs

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn error(&mut self) {
        if !self.at_set(&RECOVERY_SET) && !self.at_end() {
            self.bump();
        }
    }

    // snip

    pub(crate) fn at_end(&mut self) -> bool {
        self.peek().is_none()
    }

    // snip
}

Our test is still failing because we’re calling unwrap in variable_def. We don’t need that anymore because we’re implementing proper error recovery:

// stmt.rs

fn variable_def(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::LetKw));
    let m = p.start();
    p.bump();

    p.expect(SyntaxKind::Ident);
    p.expect(SyntaxKind::Equals);

    expr::expr(p);

    m.complete(p, SyntaxKind::VariableDef)
}

Running our tests shows that the parser is parsing the first variable definition correctly, but the second one is missing. This is because grammar::root only consumes a single statement. Let’s make it consume multiple:

// grammar.rs

pub(crate) fn root(p: &mut Parser) -> CompletedMarker {
    let m = p.start();

    while !p.at_end() {
        stmt::stmt(p);
    }

    m.complete(p, SyntaxKind::Root)
}

Let’s write a test for that:

#[cfg(test)]
mod tests {
    use crate::check;
    use expect_test::expect;

    #[test]
    fn parse_multiple_statements() {
        check(
            "let a = 1\na",
            expect![[r#"
Root@0..11
  VariableDef@0..10
    LetKw@0..3 "let"
    Whitespace@3..4 " "
    Ident@4..5 "a"
    Whitespace@5..6 " "
    Equals@6..7 "="
    Whitespace@7..8 " "
    Literal@8..10
      Number@8..9 "1"
      Whitespace@9..10 "\n"
  VariableRef@10..11
    Ident@10..11 "a""#]],
        );
    }
}

Now both that test and recover_on_let_token pass!

$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 20 tests
....................
test result: ok. 20 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s wrap the tokens we bump in Parser::error in an Error node so we can distinguish them from tokens that are actually meant to be there:

// parser.rs

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn error(&mut self) {
        if !self.at_set(&RECOVERY_SET) && !self.at_end() {
            let m = self.start();
            self.bump();
            m.complete(self, SyntaxKind::Error);
        }
    }

    // snip
}

We don’t currently have any tests that contain erroneous tokens that end up being skipped, so all our tests are still passing.

Currently, expr_binding_power doesn’t do any error recovery directly; heck, we still have those comments in there reminding us to add it! Let’s remove them:

// expr.rs

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        let op = match p.peek() {
            // snip
            _ => break,
        };

        // snip
    }

    Some(lhs)
}

What should we do in the case of the break when we’re working out the binding power of the current operator? One approach is to fall back to naive error recovery and loop until we find an operator. This brings with it all the problems of this strategy, though. Another option that’s used by rust-analyzer is to define a dummy NotAnOp variant of BinaryOp with a binding power of (0, 1). Let’s try that out:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => BinaryOp::Add,
            Some(SyntaxKind::Minus) => BinaryOp::Sub,
            Some(SyntaxKind::Star) => BinaryOp::Mul,
            Some(SyntaxKind::Slash) => BinaryOp::Div,
            _ => {
                p.error();
                BinaryOp::NotAnOp
            }
        };

        // snip
    }

    Some(lhs)
}

// snip

enum BinaryOp {
    Add,
    Sub,
    Mul,
    Div,
    NotAnOp,
}

impl BinaryOp {
    fn binding_power(&self) -> (u8, u8) {
        match self {
            Self::NotAnOp => (0, 1),
            Self::Add | Self::Sub => (1, 2),
            Self::Mul | Self::Div => (3, 4),
        }
    }
}

If we run our tests we’ll see that a lot have panicked on Option::unwrap. This is due to the p.bump() call in expr_binding_power that panics when the parser is at the end of its input. We can solve this bug by adding an arm for None, and breaking out of the loop. It’s correct not to call Parser::error here, since parsing out a left-hand side and then being at the end of the input isn’t bad – it just means that this expression wasn’t binary.

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => BinaryOp::Add,
            Some(SyntaxKind::Minus) => BinaryOp::Sub,
            Some(SyntaxKind::Star) => BinaryOp::Mul,
            Some(SyntaxKind::Slash) => BinaryOp::Div,
            None => break,
            _ => {
                p.error();
                BinaryOp::NotAnOp
            }
        };

        // snip
    }

    Some(lhs)
}

Some of our tests are now failing on a failed assertion that the parser is up to ). This is because we also need to return from expr_binding_power in that case; if we see a ), we know we’re done parsing this expression and can let the caller continue:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => BinaryOp::Add,
            Some(SyntaxKind::Minus) => BinaryOp::Sub,
            Some(SyntaxKind::Star) => BinaryOp::Mul,
            Some(SyntaxKind::Slash) => BinaryOp::Div,
            Some(SyntaxKind::RParen) | None => break,
            _ => {
                p.error();
                BinaryOp::NotAnOp
            }
        };

        // snip
    }

    Some(lhs)
}

We now have one failing test – parse_multiple_statements. Here’s the test’s input string:

let a = 1
a

The parser goes along its way, bumping let, a and =, at which point it needs an expression. It parses out 1 as lhs, and thinks ‘alright, I now need an operator’. It sees the a on the next line and thinks ‘oh, I didn’t expect this! I guess the user meant to type an operator, but typed a instead. Since this isn’t in RECOVERY_SET I’ll wrap this in an Error node and continue parsing’. The parser now tries to bump past the operator, even though it’s already done so in Parser::error and therefore is at the end of the input, causing a panic.

We have two options:

To me, option two seems like the lesser evil, so that’s what we’ll implement. Let’s undo all those changes:

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        let op = match p.peek() {
            Some(SyntaxKind::Plus) => BinaryOp::Add,
            Some(SyntaxKind::Minus) => BinaryOp::Sub,
            Some(SyntaxKind::Star) => BinaryOp::Mul,
            Some(SyntaxKind::Slash) => BinaryOp::Div,

            // We’re not at an operator; we don’t know what to do next, so we return and let the
            // caller decide.
            _ => break,
        };

        // snip
    }

    Some(lhs)
}

// snip

enum BinaryOp {
    Add,
    Sub,
    Mul,
    Div,
}

impl BinaryOp {
    fn binding_power(&self) -> (u8, u8) {
        match self {
            Self::Add | Self::Sub => (1, 2),
            Self::Mul | Self::Div => (3, 4),
        }
    }
}

The only sketchy non-recovering bit of the parser left is paren_expr, which asserts the existence of a closing parenthesis. Let’s write a test to make sure that we don’t get a panic on an unclosed parenthesis:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_unclosed_parentheses() {
        check(
            "(foo",
            expect![[r#"
Root@0..4
  ParenExpr@0..4
    LParen@0..1 "("
    VariableRef@1..4
      Ident@1..4 "foo""#]],
        );
    }
}
$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 21 tests
..............F......
failures:

---- grammar::expr::tests::parse_unclosed_parentheses stdout ----
thread 'grammar::expr::tests::parse_unclosed_parentheses' panicked at 'assertion failed: p.at(SyntaxKind::RParen)', crates/parser/src/grammar/expr.rs:122:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    grammar::expr::tests::parse_unclosed_parentheses

test result: FAILED. 20 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out

As expected, the parser panicked on that assert. Let’s replace the assert-and-bump dance with a simple call to expect:

fn paren_expr(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::LParen));

    let m = p.start();
    p.bump();
    expr_binding_power(p, 0);
    p.expect(SyntaxKind::RParen);

    m.complete(p, SyntaxKind::ParenExpr)
}
$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 21 tests
.....................
test result: ok. 21 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Although we’re done with error recovery for now, this topic is deep and we’ll have to consider it in all changes we make to the parser in future.

Implementing error reporting

Now that our parser doesn’t explode when the user makes a typo, we’re ready to move on to the next topic: error reporting. In cases such as the missing parenthesis, the error of not including the parenthesis is silently ignored. A way to ensure parse errors are being created properly is to include them below the syntax tree in tests; with this change all inputs that have parse errors have to document them.

One aspect of good error messages is that they show what tokens the parser was expecting to see in the situation that caused the error. This gives the user a hint as to what they might try to fix the error. To implement this behaviour, the parser has to track which tokens it’s expecting throughout the parsing process; when it encounters an error this set can be displayed. We can use a technique mentioned on Reddit by u/matklad:

Let’s write this now:

// parser.rs

pub(crate) struct Parser<'t, 'input> {
    source: Source<'t, 'input>,
    events: Vec<Event>,
    expected_kinds: Vec<SyntaxKind>,
}

impl<'t, 'input> Parser<'t, 'input> {
    pub(crate) fn new(source: Source<'t, 'input>) -> Self {
        Self {
            source,
            events: Vec::new(),
            expected_kinds: Vec::new(),
        }
    }

    // snip

    pub(crate) fn bump(&mut self) {
        self.expected_kinds.clear();
        self.source.next_token().unwrap();
        self.events.push(Event::AddToken);
    }

    pub(crate) fn at(&mut self, kind: SyntaxKind) -> bool {
        self.expected_kinds.push(kind);
        self.peek() == Some(kind)
    }

    // snip
}

Throughout grammar we often match on p.peek() to determine what to do next. Although this results in easy-to-read code, it is incompatible with this new approach. How is the parser meant to manage expected_kinds if it can’t know which arm of the match is being run and which ones have been tried before? We’ll have to convert all these matches to if else chains that use Parser::at.

To ensure we don’t accidentally use the old match approach in future, let’s make Parser::peek private:

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    fn peek(&mut self) -> Option<SyntaxKind> {
        self.source.peek_kind()
    }
}

Let’s update all the usages of Parser::peek now:

// expr.rs

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        let op = if p.at(SyntaxKind::Plus) {
            BinaryOp::Add
        } else if p.at(SyntaxKind::Minus) {
            BinaryOp::Sub
        } else if p.at(SyntaxKind::Star) {
            BinaryOp::Mul
        } else if p.at(SyntaxKind::Slash) {
            BinaryOp::Div
        } else {
            // We’re not at an operator; we don’t know what to do next, so we return and let the
            // caller decide.
            break;
        };

        // snip
    }

    Some(lhs)
}

fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
    let cm = if p.at(SyntaxKind::Number) {
        literal(p)
    } else if p.at(SyntaxKind::Ident) {
        variable_ref(p)
    } else if p.at(SyntaxKind::Minus) {
        prefix_expr(p)
    } else if p.at(SyntaxKind::LParen) {
        paren_expr(p)
    } else {
        p.error();
        return None;
    };

    Some(cm)
}
// stmt.rs

pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
    if p.at(SyntaxKind::LetKw) {
        Some(variable_def(p))
    } else {
        expr::expr(p)
    }
}

That’s … not that bad. Most long if else chains look horribly ugly, but this one is ok.

The next change we need to make is emit errors from the parser. For that we’ll need a type that can represent a parse error:

// parser.rs

pub(crate) mod marker;

mod parse_error;
pub(crate) use parse_error::ParseError;

use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use marker::Marker;
use syntax::SyntaxKind;
// crates/parser/src/parser/parse_error.rs

use syntax::SyntaxKind;
use text_size::TextRange;

#[derive(Debug, PartialEq)]
pub(crate) struct ParseError {
    pub(super) expected: Vec<SyntaxKind>,
    pub(super) found: Option<SyntaxKind>,
    pub(super) range: TextRange,
}
# crates/parser/Cargo.toml

[dependencies]
drop_bomb = "0.1.5"
lexer = {path = "../lexer"}
rowan = "0.10.0"
syntax = {path = "../syntax"}
text-size = "1.0.0"

We’re using TextRange from the text-size crate instead of std::ops::Range<usize> because TextRange uses u32 internally, making its size in memory smaller.

The way we’ll communicate the parse errors the parser finds to other code is through the existing Event infrastructure. Let’s create Event::Error:

// event.rs

use crate::parser::ParseError;
use syntax::SyntaxKind;

#[derive(Debug, PartialEq)]
pub(crate) enum Event {
    StartNode {
        kind: SyntaxKind,
        forward_parent: Option<usize>,
    },
    AddToken,
    FinishNode,
    Error(ParseError),
    Placeholder,
}

We can push the Event::Error event when the parser encounters an error (which we’ve conveniently centralised into the Parser::error method):

// parser.rs

use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use marker::Marker;
use std::mem;
use syntax::SyntaxKind;

// snip

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn error(&mut self) {
        let found = self.peek();
        self.events.push(Event::Error(ParseError {
            expected: mem::take(&mut self.expected_kinds),
            found,
            range: todo!(),
        }));

        if !self.at_set(&RECOVERY_SET) && !self.at_end() {
            let m = self.start();
            self.bump();
            m.complete(self, SyntaxKind::Error);
        }
    }

    // snip
}

That mem::take function returns what is passed in, and replaces the thing that was passed in with its default value. In other words, it lets us concisely take ownership of self.expected_kinds while replacing it with a new empty Vec. We don’t have a way to get the range of the current token, though.

Let’s add a range field to Token:

// crates/lexer/src/lib.rs

use logos::Logos;
use text_size::TextRange;

// snip

#[derive(Debug, PartialEq)]
pub struct Token<'a> {
    pub kind: TokenKind,
    pub text: &'a str,
    pub range: TextRange,
}
# crates/lexer/Cargo.toml

[dependencies]
logos = "0.11.4"
text-size = "1.0.0"

We need to update Lexer to emit ranges:

// lib.rs

use logos::Logos;
use std::convert::TryFrom;
use std::ops::Range as StdRange;
use text_size::{TextRange, TextSize};

// snip

impl<'a> Iterator for Lexer<'a> {
    type Item = Token<'a>;

    fn next(&mut self) -> Option<Self::Item> {
        let kind = self.inner.next()?;
        let text = self.inner.slice();

        let range = {
            let StdRange { start, end } = self.inner.span();
            let start = TextSize::try_from(start).unwrap();
            let end = TextSize::try_from(end).unwrap();

            TextRange::new(start, end)
        };

        Some(Self::Item { kind, text, range })
    }
}
// token_kind.rs

#[cfg(test)]
mod tests {
    use super::*;
    use crate::Lexer;

    fn check(input: &str, kind: TokenKind) {
        let mut lexer = Lexer::new(input);

        let token = lexer.next().unwrap();
        assert_eq!(token.kind, kind);
        assert_eq!(token.text, input);
    }

    // snip
}

To access these ranges from the parser, we need to change Source’s API:

// source.rs

impl<'t, 'input> Source<'t, 'input> {
    // snip

    pub(crate) fn peek_token(&mut self) -> Option<&Token> {
        self.eat_trivia();
        self.peek_token_raw()
    }

    // snip

    fn peek_kind_raw(&self) -> Option<SyntaxKind> {
        self.peek_token_raw()
            .map(|Token { kind, .. }| (*kind).into())
    }

    fn peek_token_raw(&self) -> Option<&Token> {
        self.tokens.get(self.cursor)
    }
}

And use it from Parser::error:

use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use lexer::Token;
use marker::Marker;
use std::mem;
use syntax::SyntaxKind;

// snip

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn error(&mut self) {
        let current_token = self.source.peek_token();

        let (found, range) = current_token.map_or_else(
            || (None, todo!()),
            |Token { kind, range, .. }| (Some((*kind).into()), *range),
        );

        self.events.push(Event::Error(ParseError {
            expected: mem::take(&mut self.expected_kinds),
            found,
            range,
        }));

        if !self.at_set(&RECOVERY_SET) && !self.at_end() {
            let m = self.start();
            self.bump();
            m.complete(self, SyntaxKind::Error);
        }
    }

    // snip
}

Take a look at that todo!() though; what range do we use for the parse error if we’re at the end of the input? To me the most sensible answer is to use the range of the last token. Let’s add another method to Source:

use lexer::Token;
use syntax::SyntaxKind;
use text_size::TextRange;

// snip

impl<'t, 'input> Source<'t, 'input> {
    // snip

    pub(crate) fn last_token_range(&self) -> Option<TextRange> {
        self.tokens.last().map(|Token { range, .. }| *range)
    }

    // snip
}

We return an Option because it isn’t up to Source to handle the case in which the input is empty; we’ll leave that up to the parser. Speaking of which, Parser:error won’t ever be called when the input is empty because we know that grammar::root is fine with empty inputs. Thus, it is safe to unwrap:

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn error(&mut self) {
        let current_token = self.source.peek_token();

        let (found, range) = current_token.map_or_else(
            // If we’re at the end of the input we use the range of the very last token in the
            // input.
            || (None, self.source.last_token_range().unwrap()),
            |Token { kind, range, .. }| (Some((*kind).into()), *range),
        );

        self.events.push(Event::Error(ParseError {
            expected: mem::take(&mut self.expected_kinds),
            found,
            range,
        }));

        if !self.at_set(&RECOVERY_SET) && !self.at_end() {
            let m = self.start();
            self.bump();
            m.complete(self, SyntaxKind::Error);
        }
    }

    // snip
}

We have one last compilation error in Sink, which we need to change to maintain a list of parse errors:

// sink.rs

use super::event::Event;
use crate::parser::ParseError;
use lexer::Token;
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::mem;
use syntax::{EldiroLanguage, SyntaxKind};

pub(crate) struct Sink<'t, 'input> {
    builder: GreenNodeBuilder<'static>,
    tokens: &'t [Token<'input>],
    cursor: usize,
    events: Vec<Event>,
    errors: Vec<ParseError>,
}

impl<'t, 'input> Sink<'t, 'input> {
    pub(crate) fn new(tokens: &'t [Token<'input>], events: Vec<Event>) -> Self {
        Self {
            builder: GreenNodeBuilder::new(),
            tokens,
            cursor: 0,
            events,
            errors: Vec::new(),
        }
    }

    pub(crate) fn finish(mut self) -> GreenNode {
        for idx in 0..self.events.len() {
            match mem::replace(&mut self.events[idx], Event::Placeholder) {
                // snip
                Event::Error(error) => self.errors.push(error),
                // snip
            }

            self.eat_trivia();
        }

        self.builder.finish()
    }

    // snip

    fn token(&mut self) {
        let Token { kind, text, .. } = self.tokens[self.cursor];

        self.builder
            .token(EldiroLanguage::kind_to_raw(kind.into()), text.into());

        self.cursor += 1;
    }
}

Fixing these basic compilation errors has revealed a borrow checker error; fortunately for us, this is trivial to fix by converting the call to Option::map_or_else in Parser::error to an if let:

// parser.rs

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn error(&mut self) {
        let current_token = self.source.peek_token();

        let (found, range) = if let Some(Token { kind, range, .. }) = current_token {
            (Some((*kind).into()), *range)
        } else {
            // If we’re at the end of the input we use the range of the very last token in the
            // input.
            (None, self.source.last_token_range().unwrap())
        };

        self.events.push(Event::Error(ParseError {
            expected: mem::take(&mut self.expected_kinds),
            found,
            range,
        }));

        if !self.at_set(&RECOVERY_SET) && !self.at_end() {
            let m = self.start();
            self.bump();
            m.complete(self, SyntaxKind::Error);
        }
    }

    // snip
}

Although we now have a Vec<ParseError>, we aren’t using it! Let’s add it as a field to Parse, which we’ll construct directly from Sink::finish:

// lib.rs

use lexer::Lexer;
use parser::{ParseError, Parser};
use rowan::GreenNode;
use sink::Sink;
use source::Source;
use syntax::SyntaxNode;

pub fn parse(input: &str) -> Parse {
    let tokens: Vec<_> = Lexer::new(input).collect();
    let source = Source::new(&tokens);
    let parser = Parser::new(source);
    let events = parser.parse();
    let sink = Sink::new(&tokens, events);

    sink.finish()
}

pub struct Parse {
    green_node: GreenNode,
    errors: Vec<ParseError>,
}
// sink.rs

use crate::event::Event;
use crate::parser::ParseError;
use crate::Parse;
use lexer::Token;
use rowan::{GreenNodeBuilder, Language};
use std::mem;
use syntax::{EldiroLanguage, SyntaxKind};

// snip

impl<'t, 'input> Sink<'t, 'input> {
    // snip

    pub(crate) fn finish(mut self) -> Parse {
        // snip

        Parse {
            green_node: self.builder.finish(),
            errors: self.errors,
        }
    }

    // snip
}

We need to be able to display parse errors, both for display to the user and for usage in tests. Let’s write a test to start:

// parse_error.rs

#[cfg(test)]
mod tests {
    use super::*;
    use std::ops::Range as StdRange;

    fn check(
        expected: Vec<SyntaxKind>,
        found: Option<SyntaxKind>,
        range: StdRange<u32>,
        output: &str,
    ) {
        let error = ParseError {
            expected,
            found,
            range: {
                let start = range.start.into();
                let end = range.end.into();
                TextRange::new(start, end)
            },
        };

        assert_eq!(format!("{}", error), output);
    }

    #[test]
    fn one_expected_did_find() {
        check(
            vec![SyntaxKind::Equals],
            Some(SyntaxKind::Ident),
            10..20,
            "error at 10..20: expected ‘=’, but found identifier",
        );
    }
}

Let’s write a dead-simple initial implementation:

use std::fmt;
use syntax::SyntaxKind;
use text_size::TextRange;

// snip

impl fmt::Display for ParseError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "error at {}..{}: expected {}, but found {}",
            u32::from(self.range.start()),
            u32::from(self.range.end()),
            self.expected[0],
            self.found.unwrap(),
        )
    }
}

This depends on a Display implementation on SyntaxKind; we can write that now:

// crates/syntax/src/lib.rs

use lexer::TokenKind;
use num_derive::{FromPrimitive, ToPrimitive};
use num_traits::{FromPrimitive, ToPrimitive};
use std::fmt;

// snip

impl fmt::Display for SyntaxKind {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        f.write_str(match self {
            SyntaxKind::Whitespace => "whitespace",
            SyntaxKind::FnKw => "‘fn’",
            SyntaxKind::LetKw => "‘let’",
            SyntaxKind::Ident => "identifier",
            SyntaxKind::Number => "number",
            SyntaxKind::Plus => "‘+’",
            SyntaxKind::Minus => "‘-’",
            SyntaxKind::Star => "‘*’",
            SyntaxKind::Slash => "‘/’",
            SyntaxKind::Equals => "‘=’",
            SyntaxKind::LParen => "‘(’",
            SyntaxKind::RParen => "‘)’",
            SyntaxKind::LBrace => "‘{’",
            SyntaxKind::RBrace => "‘}’",
            SyntaxKind::Comment => "comment",
            _ => unreachable!(),
        })
    }
}

Let’s verify that parse errors are emitted properly by including them in tests:

// crates/parser/src/lib.rs

impl Parse {
    pub fn debug_tree(&self) -> String {
        let mut s = String::new();

        let syntax_node = SyntaxNode::new_root(self.green_node.clone());
        let tree = format!("{:#?}", syntax_node);

        // We cut off the last byte because formatting the SyntaxNode adds on a newline at the end.
        s.push_str(&tree[0..tree.len() - 1]);

        for error in &self.errors {
            s.push_str(&format!("\n{}", error));
        }

        s
    }
}

We get two test failures, one of which is from the unwrap in ParseError’s Display implementation, while the other is from the error message not being in the test’s expected text. Let’s write a test for the case where the error is at the end of the input and found is therefore None:

// parse_error.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn one_expected_did_not_find() {
        check(
            vec![SyntaxKind::RParen],
            None,
            5..6,
            "error at 5..6: expected ‘)’",
        );
    }
}

To solve this, we can change the Display implementation to check if found is None:

impl fmt::Display for ParseError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "error at {}..{}: expected {}",
            u32::from(self.range.start()),
            u32::from(self.range.end()),
            self.expected[0]
        )?;

        if let Some(found) = self.found {
            write!(f, ", but found {}", found)?;
        }

        Ok(())
    }
}

We should write a test to cover the case where expected has more than one kind:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn multiple_expected_did_find() {
        check(
            vec![
                SyntaxKind::Number,
                SyntaxKind::Ident,
                SyntaxKind::Minus,
                SyntaxKind::LParen,
            ],
            Some(SyntaxKind::LetKw),
            100..105,
            "error at 100..105: expected number, identifier, ‘-’ or ‘(’, but found ‘let’",
        );
    }
}

This can be implemented with a loop:

impl fmt::Display for ParseError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "error at {}..{}: expected ",
            u32::from(self.range.start()),
            u32::from(self.range.end()),
        )?;

        let num_expected = self.expected.len();
        let is_first = |idx| idx == 0;
        let is_last = |idx| idx == num_expected - 1;

        for (idx, expected_kind) in self.expected.iter().enumerate() {
            if is_first(idx) {
                write!(f, "{}", expected_kind)?;
            } else if is_last(idx) {
                write!(f, " or {}", expected_kind)?;
            } else {
                write!(f, ", {}", expected_kind)?;
            }
        }

        if let Some(found) = self.found {
            write!(f, ", but found {}", found)?;
        }

        Ok(())
    }
}

Let’s add a test for when two SyntaxKind are expected to make sure that or is added between them instead of a comma:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn two_expected_did_find() {
        check(
            vec![SyntaxKind::Plus, SyntaxKind::Minus],
            Some(SyntaxKind::Equals),
            0..1,
            "error at 0..1: expected ‘+’ or ‘-’, but found ‘=’",
        );
    }

    // snip
}

Looking through the failed tests shows that the error messages are being generated properly. We can update these automatically using expect-test:

$ UPDATE_EXPECT=1 cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 25 tests
.........................
test result: ok. 25 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s try out our error recovery and reporting in the REPL:

$ cargo r -q
→ (1+
Root@0..4
  ParenExpr@0..4
    LParen@0..1 "("
    InfixExpr@1..4
      Literal@1..2
        Number@1..2 "1"
      Plus@2..3 "+"
      Whitespace@3..4 "\n"
error at 3..4: expected number, identifier, ‘-’ or ‘(’
error at 3..4: expected ‘+’, ‘-’, ‘*’, ‘/’ or ‘)’

The first of those two parse errors makes sense, because an expression is expected after an operator. The second one, not so much; why would you add another operator after an existing operator?

It’s non-obvious why the parser is emitting the second error message (apart from the )); to understand why it’s doing this we should keep the basic structure of expr_binding_power in mind:

fn expr_binding_power() -> Option<CompletedMarker> {
    let lhs = lhs()?;

    loop {
        let op = get_operator();
        expr_binding_power();
    }

    Some(lhs)
}

Let’s imagine we’re parsing the 1+ in (1+. We successfully get a left-hand side, and now enter the loop and extract an operator. We now recurse, and fail to get a left-hand side. This means we return None from this inner call. Here’s where the problem lies: rather than exiting because we couldn’t get a right-hand side, we just loop around again and try to get another operator. Let’s add a check to expr_binding_power to fix this issue:

// expr.rs

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
    let mut lhs = lhs(p)?;

    loop {
        // snip

        let m = lhs.precede(p);
        let parsed_rhs = expr_binding_power(p, right_binding_power).is_some();
        lhs = m.complete(p, SyntaxKind::InfixExpr);

        if !parsed_rhs {
            break;
        }
    }

    Some(lhs)
}

Instead of manually typing out that scenario from before in the REPL to find out if we’ve squashed the bug, we can write a test to ensure we never regress on this specific scenario:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn do_not_parse_operator_if_gettting_rhs_failed() {
        check(
            "(1+",
            expect![[r#"
Root@0..3
  ParenExpr@0..3
    LParen@0..1 "("
    InfixExpr@1..3
      Literal@1..2
        Number@1..2 "1"
      Plus@2..3 "+"
error at 2..3: expected number, identifier, ‘-’ or ‘(’
error at 2..3: expected ‘)’"#]],
        );
    }

    // snip
}

It makes sense that we would get two errors, since there really are two errors here: that of the binary expression missing a right-hand side, and that of an unclosed parenthesis.

$ cargo t -q --lib

running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out


running 26 tests
..........................
test result: ok. 26 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Refactoring

We’re almost done! Currently, Eldiro uses TokenKinds exclusively in the lexer, and converts them to SyntaxKinds for use everywhere else. We’d be better off with using TokenKind for things that can’t be any of the node variants exclusive to SyntaxKind, such as the SyntaxKind passed into Parser::at. This is a straightforward change to make, so I’ll just show the changed sections without any explanation:

// grammar.rs

use crate::parser::marker::CompletedMarker;
use crate::parser::Parser;
use lexer::TokenKind;
use syntax::SyntaxKind;
// parser.rs

use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use lexer::{Token, TokenKind};
use marker::Marker;
use std::mem;
use syntax::SyntaxKind;

const RECOVERY_SET: [TokenKind; 1] = [TokenKind::LetKw];

pub(crate) struct Parser<'t, 'input> {
    source: Source<'t, 'input>,
    events: Vec<Event>,
    expected_kinds: Vec<TokenKind>,
}

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    pub(crate) fn expect(&mut self, kind: TokenKind) {
        // snip
    }

    pub(crate) fn error(&mut self) {
        let current_token = self.source.peek_token();

        let (found, range) = if let Some(Token { kind, range, .. }) = current_token {
            (Some(*kind), *range)
        } else {
            // If we’re at the end of the input we use the range of the very last token in the
            // input.
            (None, self.source.last_token_range().unwrap())
        };

        // snip
    }

    // snip

    pub(crate) fn at(&mut self, kind: TokenKind) -> bool {
        // snip
    }

    fn at_set(&mut self, set: &[TokenKind]) -> bool {
        // snip
    }

    // snip

    fn peek(&mut self) -> Option<TokenKind> {
        // snip
    }
}
// crates/lexer/src/token_kind.rs

use logos::Logos;
use std::fmt;

// snip

impl TokenKind {
    pub fn is_trivia(self) -> bool {
        matches!(self, Self::Whitespace | Self::Comment)
    }
}

impl fmt::Display for TokenKind {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        f.write_str(match self {
            Self::Whitespace => "whitespace",
            Self::FnKw => "‘fn’",
            Self::LetKw => "‘let’",
            Self::Ident => "identifier",
            Self::Number => "number",
            Self::Plus => "‘+’",
            Self::Minus => "‘-’",
            Self::Star => "‘*’",
            Self::Slash => "‘/’",
            Self::Equals => "‘=’",
            Self::LParen => "‘(’",
            Self::RParen => "‘)’",
            Self::LBrace => "‘{’",
            Self::RBrace => "‘}’",
            Self::Comment => "comment",
            Self::Error => "an unrecognized token",
        })
    }
}

Delete SyntaxKind::is_trivia and SyntaxKind’s Display implementation. Replace all uses of SyntaxKind with TokenKind in parse_error.rs and sink.rs. Now follow the Rust compiler’s errors to fix all the type errors that have now cropped up across the project. Sorry for this having a bit of a ‘this is left as an exercise for the reader’ feel, but I think that making this part any longer than it already is for the sake of simple substitutions is not a good idea.

Take a look at the diff of the commit that makes this change if you get confused or stuck.

Conclusion

Hopefully this wasn’t too difficult to follow! We’ll create structures to work with parsed code in the next part.


  1. I mention this because, in my experience, this case is very rare; my most common typos with infix operators are mistyping one for the other, not as something completely different. ↩︎