So far, our parser has only concerned itself with input that is correct. In a world of IDEs and language servers, however, the input to our parser will be wrong more often than not. Therefore, it is important that our parser can:
- recover gracefully from errors
- infer as detailed a syntax tree from the input as possible
- report errors in a user-friendly manner
Unfortunately for us, the only thing our parser currently supports is arithmetic expressions, which don’t make error recovery techniques easy to demonstrate. Let’s first implement a parser for variable definitions.
Variable definitions
Since variable definitions aren’t expressions, we shouldn’t put them in the expr
module. Let’s create a new stmt
module for statements:
// grammar.rs
mod expr;
mod stmt;
// crates/parser/src/grammar/stmt.rs
use super::*;
pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
todo!()
}
We’ll use the same syntax for variable definitions as we used before the rewrite:
let name = expr
A test:
#[cfg(test)]
mod tests {
use crate::check;
use expect_test::expect;
#[test]
fn parse_variable_definition() {
check(
"let foo = bar",
expect![[r#"
Root@0..13
VariableDef@0..13
LetKw@0..3 "let"
Whitespace@3..4 " "
Ident@4..7 "foo"
Whitespace@7..8 " "
Equals@8..9 "="
Whitespace@9..10 " "
VariableRef@10..13
Ident@10..13 "bar""#]],
);
}
}
We can now implement parsing of variable definitions:
pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
match p.peek() {
Some(SyntaxKind::LetKw) => variable_def(p),
_ => None,
}
}
fn variable_def(p: &mut Parser) -> Option<CompletedMarker> {
assert!(p.at(SyntaxKind::LetKw));
let m = p.start();
p.bump();
assert!(p.at(SyntaxKind::Ident));
p.bump();
assert!(p.at(SyntaxKind::Equals));
p.bump();
expr::expr(p)?;
Some(m.complete(p, SyntaxKind::VariableDef))
}
Instead of handling errors, we repeatedly assert
the token we’re at; since we know at that point the token we’re at, we can just bump
past it.
Let’s quickly define SyntaxKind::VariableDef
:
// crates/syntax/src/lib.rs
#[derive(Debug, Copy, Clone, PartialEq, FromPrimitive, ToPrimitive)]
pub enum SyntaxKind {
// snip
Root,
InfixExpr,
Literal,
ParenExpr,
PrefixExpr,
VariableDef,
VariableRef,
}
Our parser compiles, but we get a failing test because root
doesn’t use stmt
. The solution is to have stmt
fall back to calling expr
if it isn’t sure what it’s looking at:
// crates/parser/src/grammar/stmt.rs
pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
match p.peek() {
Some(SyntaxKind::LetKw) => variable_def(p),
_ => expr::expr(p),
}
}
We can now call stmt
from root
:
// grammar.rs
pub(crate) fn root(p: &mut Parser) -> CompletedMarker {
let m = p.start();
stmt::stmt(p);
m.complete(p, SyntaxKind::Root)
}
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 18 tests
...............F..
failures:
---- grammar::stmt::tests::parse_variable_definition stdout ----
thread 'grammar::stmt::tests::parse_variable_definition' panicked at 'Markers need to be completed', /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/drop_bomb-0.1.5/src/lib.rs:113:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
grammar::stmt::tests::parse_variable_definition
test result: FAILED. 17 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out
What seems to have happened is that expr
has returned None
, causing variable_def
to return early, which in turn means the marker isn’t completed. Since we don’t care about error handling yet, we can add an unwrap
:
// stmt.rs
pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
match p.peek() {
Some(SyntaxKind::LetKw) => Some(variable_def(p)),
_ => expr::expr(p),
}
}
fn variable_def(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::LetKw));
let m = p.start();
p.bump();
assert!(p.at(SyntaxKind::Ident));
p.bump();
assert!(p.at(SyntaxKind::Equals));
p.bump();
expr::expr(p).unwrap();
m.complete(p, SyntaxKind::VariableDef)
}
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 18 tests
..............F...
failures:
---- grammar::stmt::tests::parse_variable_definition stdout ----
thread 'grammar::stmt::tests::parse_variable_definition' panicked at 'called `Option::unwrap()` on a `None` value', crates/parser/src/grammar/stmt.rs:21:19
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
grammar::stmt::tests::parse_variable_definition
test result: FAILED. 17 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out
Huh, that’s weird; why is expr
returning None
? The only places where it can do so are let lhs = lhs()?;
and if it doesn’t see an operator. Oh, wait.
We should be returning lhs
at that point instead:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?; // we’ll handle errors later.
loop {
let op = match p.peek() {
// snip
_ => break, // we’ll handle errors later.
};
// snip
}
Some(lhs)
}
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
A short intermission
Let’s say the parser is being used as part of a language server, and the user has typed out a variable definition:
let b = a
They realise they haven’t defined a
, so they define it above:
let a = 10
let b = a
While the user is in the middle of typing that, our parser sees this:
let a =
let b = a
What we want is for our parser to recognise two variable definitions, with the first missing an expression. This way, language intelligence like Go To Definition can continue working in spite of an incomplete file; if we tell the editor to ‘go to the definition of a
’, it will know where to go because it’s recognised two separate variable definitions. The parser would also report an error on the second let
keyword along the lines of
expected number, identifier, ‘-’ or ‘(’, but found ‘let’
Our goal for the rest of this part is to make Eldiro have this behaviour.
A naive strategy for error recovery
This is the simplest of all the error recovery techniques; if you expect to see a Foo
, but see something else, you skip tokens until you find a Foo
. This is simple to implement but doesn’t deliver the best output, since it throws away a lot of possibly-useful tokens. For example, all the underlined tokens here are thrown away:
let a =
let b = a
-------
Here the IDE’s Go To Definition on a
wouldn’t work, and the user would get an ‘undefined variable a
’ error. Evidently, this error recovery strategy doesn’t let us achieve the ideal case we laid out before.
A better way
The idea is to start with the above approach, but don’t skip certain tokens. For example, let’s say we have the following code:
{
let a = {
let bar =
}
10
}
The naive approach would start looking for an expression after the let bar =
, and would stop throwing away tokens at 10
. It now consumes the final }
to close the let a = {
block. This doesn’t leave a }
to close the surrounding block, though, which means the user gets an ‘unclosed {
’ error. Here’s roughly what the parser sees if you don’t include errors and fix the indentation:
{
let a = {
let bar = 10
}
With the fancier approach, we can tell the parser not to skip }
tokens, since those are important for recognising program structure. In languages with explicit statement terminators, parsers are often constructed so that ;
is never skipped. These tokens that aren’t to be skipped by the parser during error recovery are called ‘recovery sets’.
We can improve on this approach by only ever skipping one token, which means we throw away even less of the input.
Remember that ideal example from earlier?
let a =
let b = a
We can achieve the goal of recognising two variable declarations by including let
in the parser’s recovery set. This way, whenever the parser encounters an unexpected let
, it will stop parsing what it’s up to and keep returning from subparsers until it gets to a state in which let
can be recognised.
Implementing error recovery
Let’s write a test for the case we just discussed:
// stmt.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn recover_on_let_token() {
check(
"let a =\nlet b = a",
expect![[r#"
Root@0..17
VariableDef@0..8
LetKw@0..3 "let"
Whitespace@3..4 " "
Ident@4..5 "a"
Whitespace@5..6 " "
Equals@6..7 "="
Whitespace@7..8 "\n"
VariableDef@8..17
LetKw@8..11 "let"
Whitespace@11..12 " "
Ident@12..13 "b"
Whitespace@13..14 " "
Equals@14..15 "="
Whitespace@15..16 " "
VariableRef@16..17
Ident@16..17 "a""#]],
);
}
}
Let’s write an expect
method on Parser
that takes a given SyntaxKind
and bump
s past it if the parser is up to it; if it isn’t, then we might skip the token depending on whether it’s part of the recovery set:
// parser.rs
const RECOVERY_SET: [SyntaxKind; 1] = [SyntaxKind::LetKw];
// snip
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn expect(&mut self, kind: SyntaxKind) {
if self.at(kind) || !self.peek().map_or(false, |k| RECOVERY_SET.contains(&k)) {
self.bump();
}
}
// snip
}
Let’s use this method in variable_def
:
// stmt.rs
fn variable_def(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::LetKw));
let m = p.start();
p.bump();
p.expect(SyntaxKind::Ident);
p.expect(SyntaxKind::Equals);
expr::expr(p).unwrap();
m.complete(p, SyntaxKind::VariableDef)
}
Although this does make the code cleaner, the test still fails. After all, the part that’s breaking it is in expr
, not here!
Let’s write an error
method that we can call if expr
has exhausted all possibilities:
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn expect(&mut self, kind: SyntaxKind) {
if self.at(kind) {
self.bump();
} else {
self.error();
}
}
pub(crate) fn error(&mut self) {
if !self.at_set(&RECOVERY_SET) {
self.bump();
}
}
// snip
fn at_set(&mut self, set: &[SyntaxKind]) -> bool {
self.peek().map_or(false, |k| set.contains(&k))
}
// snip
}
We’ve reimplemented Parser::expect
in terms of Parser::error
, and we’ve also defined a small helper method that checks whether the parser is currently at a given set of SyntaxKind
s. Let’s make use of this code:
// expr.rs
fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
let cm = match p.peek() {
Some(SyntaxKind::Number) => literal(p),
Some(SyntaxKind::Ident) => variable_ref(p),
Some(SyntaxKind::Minus) => prefix_expr(p),
Some(SyntaxKind::LParen) => paren_expr(p),
_ => {
p.error();
return None;
}
};
Some(cm)
}
We now have multiple failing tests! This is because we are trying to call p.bump()
when we’ve arrived at the end of the input. We could add a check for if we are at the end of the input and not call p.error()
then, but that would be incorrect. Think about it: if we’re trying to parse the left-hand side of an expression and are at the end of the input, that’s rightly an error. Instead, we should avoid calling Parser::bump
in Parser::error
if we’re at the end of the input.
// parser.rs
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn error(&mut self) {
if !self.at_set(&RECOVERY_SET) && !self.at_end() {
self.bump();
}
}
// snip
pub(crate) fn at_end(&mut self) -> bool {
self.peek().is_none()
}
// snip
}
Our test is still failing because we’re calling unwrap
in variable_def
. We don’t need that anymore because we’re implementing proper error recovery:
// stmt.rs
fn variable_def(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::LetKw));
let m = p.start();
p.bump();
p.expect(SyntaxKind::Ident);
p.expect(SyntaxKind::Equals);
expr::expr(p);
m.complete(p, SyntaxKind::VariableDef)
}
Running our tests shows that the parser is parsing the first variable definition correctly, but the second one is missing. This is because grammar::root
only consumes a single statement. Let’s make it consume multiple:
// grammar.rs
pub(crate) fn root(p: &mut Parser) -> CompletedMarker {
let m = p.start();
while !p.at_end() {
stmt::stmt(p);
}
m.complete(p, SyntaxKind::Root)
}
Let’s write a test for that:
#[cfg(test)]
mod tests {
use crate::check;
use expect_test::expect;
#[test]
fn parse_multiple_statements() {
check(
"let a = 1\na",
expect![[r#"
Root@0..11
VariableDef@0..10
LetKw@0..3 "let"
Whitespace@3..4 " "
Ident@4..5 "a"
Whitespace@5..6 " "
Equals@6..7 "="
Whitespace@7..8 " "
Literal@8..10
Number@8..9 "1"
Whitespace@9..10 "\n"
VariableRef@10..11
Ident@10..11 "a""#]],
);
}
}
Now both that test and recover_on_let_token
pass!
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 20 tests
....................
test result: ok. 20 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s wrap the tokens we bump
in Parser::error
in an Error
node so we can distinguish them from tokens that are actually meant to be there:
// parser.rs
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn error(&mut self) {
if !self.at_set(&RECOVERY_SET) && !self.at_end() {
let m = self.start();
self.bump();
m.complete(self, SyntaxKind::Error);
}
}
// snip
}
We don’t currently have any tests that contain erroneous tokens that end up being skipped, so all our tests are still passing.
Currently, expr_binding_power
doesn’t do any error recovery directly; heck, we still have those comments in there reminding us to add it! Let’s remove them:
// expr.rs
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
let op = match p.peek() {
// snip
_ => break,
};
// snip
}
Some(lhs)
}
What should we do in the case of the break
when we’re working out the binding power of the current operator? One approach is to fall back to naive error recovery and loop until we find an operator. This brings with it all the problems of this strategy, though. Another option that’s used by rust-analyzer is to define a dummy NotAnOp
variant of BinaryOp
with a binding power of (0, 1)
. Let’s try that out:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
let op = match p.peek() {
Some(SyntaxKind::Plus) => BinaryOp::Add,
Some(SyntaxKind::Minus) => BinaryOp::Sub,
Some(SyntaxKind::Star) => BinaryOp::Mul,
Some(SyntaxKind::Slash) => BinaryOp::Div,
_ => {
p.error();
BinaryOp::NotAnOp
}
};
// snip
}
Some(lhs)
}
// snip
enum BinaryOp {
Add,
Sub,
Mul,
Div,
NotAnOp,
}
impl BinaryOp {
fn binding_power(&self) -> (u8, u8) {
match self {
Self::NotAnOp => (0, 1),
Self::Add | Self::Sub => (1, 2),
Self::Mul | Self::Div => (3, 4),
}
}
}
If we run our tests we’ll see that a lot have panicked on Option::unwrap
. This is due to the p.bump()
call in expr_binding_power
that panics when the parser is at the end of its input. We can solve this bug by adding an arm for None
, and breaking out of the loop. It’s correct not to call Parser::error
here, since parsing out a left-hand side and then being at the end of the input isn’t bad – it just means that this expression wasn’t binary.
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
let op = match p.peek() {
Some(SyntaxKind::Plus) => BinaryOp::Add,
Some(SyntaxKind::Minus) => BinaryOp::Sub,
Some(SyntaxKind::Star) => BinaryOp::Mul,
Some(SyntaxKind::Slash) => BinaryOp::Div,
None => break,
_ => {
p.error();
BinaryOp::NotAnOp
}
};
// snip
}
Some(lhs)
}
Some of our tests are now failing on a failed assertion that the parser is up to )
. This is because we also need to return from expr_binding_power
in that case; if we see a )
, we know we’re done parsing this expression and can let the caller continue:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
let op = match p.peek() {
Some(SyntaxKind::Plus) => BinaryOp::Add,
Some(SyntaxKind::Minus) => BinaryOp::Sub,
Some(SyntaxKind::Star) => BinaryOp::Mul,
Some(SyntaxKind::Slash) => BinaryOp::Div,
Some(SyntaxKind::RParen) | None => break,
_ => {
p.error();
BinaryOp::NotAnOp
}
};
// snip
}
Some(lhs)
}
We now have one failing test – parse_multiple_statements
. Here’s the test’s input string:
let a = 1
a
The parser goes along its way, bump
ing let
, a
and =
, at which point it needs an expression. It parses out 1
as lhs
, and thinks ‘alright, I now need an operator’. It sees the a
on the next line and thinks ‘oh, I didn’t expect this! I guess the user meant to type an operator, but typed a
instead. Since this isn’t in RECOVERY_SET
I’ll wrap this in an Error
node and continue parsing’. The parser now tries to bump
past the operator, even though it’s already done so in Parser::error
and therefore is at the end of the input, causing a panic.
We have two options:
- detect the newline between
1
anda
and decide that they must be separate expressions. This implies disallowing newlines between infix operators and their operands, which is a relatively common use-case (think of long expressions broken across multiple lines). We would want to handle newlines automatically, just like whitespace, which means we would have to create machinery that allows us to both check for newlines while also ignoring their existence. Furthermore, making Eldiro whitespace-sensitive is not a decision I take lightly; it’s nice to not have to worry or think about whitespace when you’re writing code, and instead let the autoformatter add it for you. - give up if we don’t see an operator and just return from the function. This means we’ll get worse error messages if the user mistypes an infix operator as another token.1
To me, option two seems like the lesser evil, so that’s what we’ll implement. Let’s undo all those changes:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
let op = match p.peek() {
Some(SyntaxKind::Plus) => BinaryOp::Add,
Some(SyntaxKind::Minus) => BinaryOp::Sub,
Some(SyntaxKind::Star) => BinaryOp::Mul,
Some(SyntaxKind::Slash) => BinaryOp::Div,
// We’re not at an operator; we don’t know what to do next, so we return and let the
// caller decide.
_ => break,
};
// snip
}
Some(lhs)
}
// snip
enum BinaryOp {
Add,
Sub,
Mul,
Div,
}
impl BinaryOp {
fn binding_power(&self) -> (u8, u8) {
match self {
Self::Add | Self::Sub => (1, 2),
Self::Mul | Self::Div => (3, 4),
}
}
}
The only sketchy non-recovering bit of the parser left is paren_expr
, which assert
s the existence of a closing parenthesis. Let’s write a test to make sure that we don’t get a panic on an unclosed parenthesis:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_unclosed_parentheses() {
check(
"(foo",
expect![[r#"
Root@0..4
ParenExpr@0..4
LParen@0..1 "("
VariableRef@1..4
Ident@1..4 "foo""#]],
);
}
}
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 21 tests
..............F......
failures:
---- grammar::expr::tests::parse_unclosed_parentheses stdout ----
thread 'grammar::expr::tests::parse_unclosed_parentheses' panicked at 'assertion failed: p.at(SyntaxKind::RParen)', crates/parser/src/grammar/expr.rs:122:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
grammar::expr::tests::parse_unclosed_parentheses
test result: FAILED. 20 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out
As expected, the parser panicked on that assert
. Let’s replace the assert
-and-bump
dance with a simple call to expect
:
fn paren_expr(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::LParen));
let m = p.start();
p.bump();
expr_binding_power(p, 0);
p.expect(SyntaxKind::RParen);
m.complete(p, SyntaxKind::ParenExpr)
}
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 21 tests
.....................
test result: ok. 21 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Although we’re done with error recovery for now, this topic is deep and we’ll have to consider it in all changes we make to the parser in future.
Implementing error reporting
Now that our parser doesn’t explode when the user makes a typo, we’re ready to move on to the next topic: error reporting. In cases such as the missing parenthesis, the error of not including the parenthesis is silently ignored. A way to ensure parse errors are being created properly is to include them below the syntax tree in tests; with this change all inputs that have parse errors have to document them.
One aspect of good error messages is that they show what tokens the parser was expecting to see in the situation that caused the error. This gives the user a hint as to what they might try to fix the error. To implement this behaviour, the parser has to track which tokens it’s expecting throughout the parsing process; when it encounters an error this set can be displayed. We can use a technique mentioned on Reddit by u/matklad:
- Every time
Parser::at
is called we’ll add theSyntaxKind
being enquired about to anexpected_kinds
field onParser
- Whenever a token is added to the syntax tree the set is cleared
Let’s write this now:
// parser.rs
pub(crate) struct Parser<'t, 'input> {
source: Source<'t, 'input>,
events: Vec<Event>,
expected_kinds: Vec<SyntaxKind>,
}
impl<'t, 'input> Parser<'t, 'input> {
pub(crate) fn new(source: Source<'t, 'input>) -> Self {
Self {
source,
events: Vec::new(),
expected_kinds: Vec::new(),
}
}
// snip
pub(crate) fn bump(&mut self) {
self.expected_kinds.clear();
self.source.next_token().unwrap();
self.events.push(Event::AddToken);
}
pub(crate) fn at(&mut self, kind: SyntaxKind) -> bool {
self.expected_kinds.push(kind);
self.peek() == Some(kind)
}
// snip
}
Throughout grammar
we often match
on p.peek()
to determine what to do next. Although this results in easy-to-read code, it is incompatible with this new approach. How is the parser meant to manage expected_kinds
if it can’t know which arm of the match
is being run and which ones have been tried before? We’ll have to convert all these match
es to if else
chains that use Parser::at
.
To ensure we don’t accidentally use the old match
approach in future, let’s make Parser::peek
private:
impl<'t, 'input> Parser<'t, 'input> {
// snip
fn peek(&mut self) -> Option<SyntaxKind> {
self.source.peek_kind()
}
}
Let’s update all the usages of Parser::peek
now:
// expr.rs
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
let op = if p.at(SyntaxKind::Plus) {
BinaryOp::Add
} else if p.at(SyntaxKind::Minus) {
BinaryOp::Sub
} else if p.at(SyntaxKind::Star) {
BinaryOp::Mul
} else if p.at(SyntaxKind::Slash) {
BinaryOp::Div
} else {
// We’re not at an operator; we don’t know what to do next, so we return and let the
// caller decide.
break;
};
// snip
}
Some(lhs)
}
fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
let cm = if p.at(SyntaxKind::Number) {
literal(p)
} else if p.at(SyntaxKind::Ident) {
variable_ref(p)
} else if p.at(SyntaxKind::Minus) {
prefix_expr(p)
} else if p.at(SyntaxKind::LParen) {
paren_expr(p)
} else {
p.error();
return None;
};
Some(cm)
}
// stmt.rs
pub(super) fn stmt(p: &mut Parser) -> Option<CompletedMarker> {
if p.at(SyntaxKind::LetKw) {
Some(variable_def(p))
} else {
expr::expr(p)
}
}
That’s … not that bad. Most long if else
chains look horribly ugly, but this one is ok.
The next change we need to make is emit errors from the parser. For that we’ll need a type that can represent a parse error:
// parser.rs
pub(crate) mod marker;
mod parse_error;
pub(crate) use parse_error::ParseError;
use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use marker::Marker;
use syntax::SyntaxKind;
// crates/parser/src/parser/parse_error.rs
use syntax::SyntaxKind;
use text_size::TextRange;
#[derive(Debug, PartialEq)]
pub(crate) struct ParseError {
pub(super) expected: Vec<SyntaxKind>,
pub(super) found: Option<SyntaxKind>,
pub(super) range: TextRange,
}
# crates/parser/Cargo.toml
[dependencies]
drop_bomb = "0.1.5"
lexer = {path = "../lexer"}
rowan = "0.10.0"
syntax = {path = "../syntax"}
text-size = "1.0.0"
We’re using TextRange
from the text-size crate instead of std::ops::Range<usize>
because TextRange
uses u32
internally, making its size in memory smaller.
The way we’ll communicate the parse errors the parser finds to other code is through the existing Event
infrastructure. Let’s create Event::Error
:
// event.rs
use crate::parser::ParseError;
use syntax::SyntaxKind;
#[derive(Debug, PartialEq)]
pub(crate) enum Event {
StartNode {
kind: SyntaxKind,
forward_parent: Option<usize>,
},
AddToken,
FinishNode,
Error(ParseError),
Placeholder,
}
We can push the Event::Error
event when the parser encounters an error (which we’ve conveniently centralised into the Parser::error
method):
// parser.rs
use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use marker::Marker;
use std::mem;
use syntax::SyntaxKind;
// snip
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn error(&mut self) {
let found = self.peek();
self.events.push(Event::Error(ParseError {
expected: mem::take(&mut self.expected_kinds),
found,
range: todo!(),
}));
if !self.at_set(&RECOVERY_SET) && !self.at_end() {
let m = self.start();
self.bump();
m.complete(self, SyntaxKind::Error);
}
}
// snip
}
That mem::take
function returns what is passed in, and replaces the thing that was passed in with its default value. In other words, it lets us concisely take ownership of self.expected_kinds
while replacing it with a new empty Vec
. We don’t have a way to get the range of the current token, though.
Let’s add a range
field to Token
:
// crates/lexer/src/lib.rs
use logos::Logos;
use text_size::TextRange;
// snip
#[derive(Debug, PartialEq)]
pub struct Token<'a> {
pub kind: TokenKind,
pub text: &'a str,
pub range: TextRange,
}
# crates/lexer/Cargo.toml
[dependencies]
logos = "0.11.4"
text-size = "1.0.0"
We need to update Lexer
to emit ranges:
// lib.rs
use logos::Logos;
use std::convert::TryFrom;
use std::ops::Range as StdRange;
use text_size::{TextRange, TextSize};
// snip
impl<'a> Iterator for Lexer<'a> {
type Item = Token<'a>;
fn next(&mut self) -> Option<Self::Item> {
let kind = self.inner.next()?;
let text = self.inner.slice();
let range = {
let StdRange { start, end } = self.inner.span();
let start = TextSize::try_from(start).unwrap();
let end = TextSize::try_from(end).unwrap();
TextRange::new(start, end)
};
Some(Self::Item { kind, text, range })
}
}
// token_kind.rs
#[cfg(test)]
mod tests {
use super::*;
use crate::Lexer;
fn check(input: &str, kind: TokenKind) {
let mut lexer = Lexer::new(input);
let token = lexer.next().unwrap();
assert_eq!(token.kind, kind);
assert_eq!(token.text, input);
}
// snip
}
To access these ranges from the parser, we need to change Source
’s API:
// source.rs
impl<'t, 'input> Source<'t, 'input> {
// snip
pub(crate) fn peek_token(&mut self) -> Option<&Token> {
self.eat_trivia();
self.peek_token_raw()
}
// snip
fn peek_kind_raw(&self) -> Option<SyntaxKind> {
self.peek_token_raw()
.map(|Token { kind, .. }| (*kind).into())
}
fn peek_token_raw(&self) -> Option<&Token> {
self.tokens.get(self.cursor)
}
}
And use it from Parser::error
:
use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use lexer::Token;
use marker::Marker;
use std::mem;
use syntax::SyntaxKind;
// snip
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn error(&mut self) {
let current_token = self.source.peek_token();
let (found, range) = current_token.map_or_else(
|| (None, todo!()),
|Token { kind, range, .. }| (Some((*kind).into()), *range),
);
self.events.push(Event::Error(ParseError {
expected: mem::take(&mut self.expected_kinds),
found,
range,
}));
if !self.at_set(&RECOVERY_SET) && !self.at_end() {
let m = self.start();
self.bump();
m.complete(self, SyntaxKind::Error);
}
}
// snip
}
Take a look at that todo!()
though; what range do we use for the parse error if we’re at the end of the input? To me the most sensible answer is to use the range of the last token. Let’s add another method to Source
:
use lexer::Token;
use syntax::SyntaxKind;
use text_size::TextRange;
// snip
impl<'t, 'input> Source<'t, 'input> {
// snip
pub(crate) fn last_token_range(&self) -> Option<TextRange> {
self.tokens.last().map(|Token { range, .. }| *range)
}
// snip
}
We return an Option
because it isn’t up to Source
to handle the case in which the input is empty; we’ll leave that up to the parser. Speaking of which, Parser:error
won’t ever be called when the input is empty because we know that grammar::root
is fine with empty inputs. Thus, it is safe to unwrap:
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn error(&mut self) {
let current_token = self.source.peek_token();
let (found, range) = current_token.map_or_else(
// If we’re at the end of the input we use the range of the very last token in the
// input.
|| (None, self.source.last_token_range().unwrap()),
|Token { kind, range, .. }| (Some((*kind).into()), *range),
);
self.events.push(Event::Error(ParseError {
expected: mem::take(&mut self.expected_kinds),
found,
range,
}));
if !self.at_set(&RECOVERY_SET) && !self.at_end() {
let m = self.start();
self.bump();
m.complete(self, SyntaxKind::Error);
}
}
// snip
}
We have one last compilation error in Sink
, which we need to change to maintain a list of parse errors:
// sink.rs
use super::event::Event;
use crate::parser::ParseError;
use lexer::Token;
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::mem;
use syntax::{EldiroLanguage, SyntaxKind};
pub(crate) struct Sink<'t, 'input> {
builder: GreenNodeBuilder<'static>,
tokens: &'t [Token<'input>],
cursor: usize,
events: Vec<Event>,
errors: Vec<ParseError>,
}
impl<'t, 'input> Sink<'t, 'input> {
pub(crate) fn new(tokens: &'t [Token<'input>], events: Vec<Event>) -> Self {
Self {
builder: GreenNodeBuilder::new(),
tokens,
cursor: 0,
events,
errors: Vec::new(),
}
}
pub(crate) fn finish(mut self) -> GreenNode {
for idx in 0..self.events.len() {
match mem::replace(&mut self.events[idx], Event::Placeholder) {
// snip
Event::Error(error) => self.errors.push(error),
// snip
}
self.eat_trivia();
}
self.builder.finish()
}
// snip
fn token(&mut self) {
let Token { kind, text, .. } = self.tokens[self.cursor];
self.builder
.token(EldiroLanguage::kind_to_raw(kind.into()), text.into());
self.cursor += 1;
}
}
Fixing these basic compilation errors has revealed a borrow checker error; fortunately for us, this is trivial to fix by converting the call to Option::map_or_else
in Parser::error
to an if let
:
// parser.rs
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn error(&mut self) {
let current_token = self.source.peek_token();
let (found, range) = if let Some(Token { kind, range, .. }) = current_token {
(Some((*kind).into()), *range)
} else {
// If we’re at the end of the input we use the range of the very last token in the
// input.
(None, self.source.last_token_range().unwrap())
};
self.events.push(Event::Error(ParseError {
expected: mem::take(&mut self.expected_kinds),
found,
range,
}));
if !self.at_set(&RECOVERY_SET) && !self.at_end() {
let m = self.start();
self.bump();
m.complete(self, SyntaxKind::Error);
}
}
// snip
}
Although we now have a Vec<ParseError>
, we aren’t using it! Let’s add it as a field to Parse
, which we’ll construct directly from Sink::finish
:
// lib.rs
use lexer::Lexer;
use parser::{ParseError, Parser};
use rowan::GreenNode;
use sink::Sink;
use source::Source;
use syntax::SyntaxNode;
pub fn parse(input: &str) -> Parse {
let tokens: Vec<_> = Lexer::new(input).collect();
let source = Source::new(&tokens);
let parser = Parser::new(source);
let events = parser.parse();
let sink = Sink::new(&tokens, events);
sink.finish()
}
pub struct Parse {
green_node: GreenNode,
errors: Vec<ParseError>,
}
// sink.rs
use crate::event::Event;
use crate::parser::ParseError;
use crate::Parse;
use lexer::Token;
use rowan::{GreenNodeBuilder, Language};
use std::mem;
use syntax::{EldiroLanguage, SyntaxKind};
// snip
impl<'t, 'input> Sink<'t, 'input> {
// snip
pub(crate) fn finish(mut self) -> Parse {
// snip
Parse {
green_node: self.builder.finish(),
errors: self.errors,
}
}
// snip
}
We need to be able to display parse errors, both for display to the user and for usage in tests. Let’s write a test to start:
// parse_error.rs
#[cfg(test)]
mod tests {
use super::*;
use std::ops::Range as StdRange;
fn check(
expected: Vec<SyntaxKind>,
found: Option<SyntaxKind>,
range: StdRange<u32>,
output: &str,
) {
let error = ParseError {
expected,
found,
range: {
let start = range.start.into();
let end = range.end.into();
TextRange::new(start, end)
},
};
assert_eq!(format!("{}", error), output);
}
#[test]
fn one_expected_did_find() {
check(
vec![SyntaxKind::Equals],
Some(SyntaxKind::Ident),
10..20,
"error at 10..20: expected ‘=’, but found identifier",
);
}
}
Let’s write a dead-simple initial implementation:
use std::fmt;
use syntax::SyntaxKind;
use text_size::TextRange;
// snip
impl fmt::Display for ParseError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"error at {}..{}: expected {}, but found {}",
u32::from(self.range.start()),
u32::from(self.range.end()),
self.expected[0],
self.found.unwrap(),
)
}
}
This depends on a Display
implementation on SyntaxKind
; we can write that now:
// crates/syntax/src/lib.rs
use lexer::TokenKind;
use num_derive::{FromPrimitive, ToPrimitive};
use num_traits::{FromPrimitive, ToPrimitive};
use std::fmt;
// snip
impl fmt::Display for SyntaxKind {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.write_str(match self {
SyntaxKind::Whitespace => "whitespace",
SyntaxKind::FnKw => "‘fn’",
SyntaxKind::LetKw => "‘let’",
SyntaxKind::Ident => "identifier",
SyntaxKind::Number => "number",
SyntaxKind::Plus => "‘+’",
SyntaxKind::Minus => "‘-’",
SyntaxKind::Star => "‘*’",
SyntaxKind::Slash => "‘/’",
SyntaxKind::Equals => "‘=’",
SyntaxKind::LParen => "‘(’",
SyntaxKind::RParen => "‘)’",
SyntaxKind::LBrace => "‘{’",
SyntaxKind::RBrace => "‘}’",
SyntaxKind::Comment => "comment",
_ => unreachable!(),
})
}
}
Let’s verify that parse errors are emitted properly by including them in tests:
// crates/parser/src/lib.rs
impl Parse {
pub fn debug_tree(&self) -> String {
let mut s = String::new();
let syntax_node = SyntaxNode::new_root(self.green_node.clone());
let tree = format!("{:#?}", syntax_node);
// We cut off the last byte because formatting the SyntaxNode adds on a newline at the end.
s.push_str(&tree[0..tree.len() - 1]);
for error in &self.errors {
s.push_str(&format!("\n{}", error));
}
s
}
}
We get two test failures, one of which is from the unwrap
in ParseError
’s Display
implementation, while the other is from the error message not being in the test’s expected text. Let’s write a test for the case where the error is at the end of the input and found
is therefore None
:
// parse_error.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn one_expected_did_not_find() {
check(
vec![SyntaxKind::RParen],
None,
5..6,
"error at 5..6: expected ‘)’",
);
}
}
To solve this, we can change the Display
implementation to check if found
is None
:
impl fmt::Display for ParseError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"error at {}..{}: expected {}",
u32::from(self.range.start()),
u32::from(self.range.end()),
self.expected[0]
)?;
if let Some(found) = self.found {
write!(f, ", but found {}", found)?;
}
Ok(())
}
}
We should write a test to cover the case where expected
has more than one kind:
#[cfg(test)]
mod tests {
// snip
#[test]
fn multiple_expected_did_find() {
check(
vec![
SyntaxKind::Number,
SyntaxKind::Ident,
SyntaxKind::Minus,
SyntaxKind::LParen,
],
Some(SyntaxKind::LetKw),
100..105,
"error at 100..105: expected number, identifier, ‘-’ or ‘(’, but found ‘let’",
);
}
}
This can be implemented with a loop:
impl fmt::Display for ParseError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"error at {}..{}: expected ",
u32::from(self.range.start()),
u32::from(self.range.end()),
)?;
let num_expected = self.expected.len();
let is_first = |idx| idx == 0;
let is_last = |idx| idx == num_expected - 1;
for (idx, expected_kind) in self.expected.iter().enumerate() {
if is_first(idx) {
write!(f, "{}", expected_kind)?;
} else if is_last(idx) {
write!(f, " or {}", expected_kind)?;
} else {
write!(f, ", {}", expected_kind)?;
}
}
if let Some(found) = self.found {
write!(f, ", but found {}", found)?;
}
Ok(())
}
}
Let’s add a test for when two SyntaxKind
are expected to make sure that or
is added between them instead of a comma:
#[cfg(test)]
mod tests {
// snip
#[test]
fn two_expected_did_find() {
check(
vec![SyntaxKind::Plus, SyntaxKind::Minus],
Some(SyntaxKind::Equals),
0..1,
"error at 0..1: expected ‘+’ or ‘-’, but found ‘=’",
);
}
// snip
}
Looking through the failed tests shows that the error messages are being generated properly. We can update these automatically using expect-test:
$ UPDATE_EXPECT=1 cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 25 tests
.........................
test result: ok. 25 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s try out our error recovery and reporting in the REPL:
$ cargo r -q
→ (1+
Root@0..4
ParenExpr@0..4
LParen@0..1 "("
InfixExpr@1..4
Literal@1..2
Number@1..2 "1"
Plus@2..3 "+"
Whitespace@3..4 "\n"
error at 3..4: expected number, identifier, ‘-’ or ‘(’
error at 3..4: expected ‘+’, ‘-’, ‘*’, ‘/’ or ‘)’
The first of those two parse errors makes sense, because an expression is expected after an operator. The second one, not so much; why would you add another operator after an existing operator?
It’s non-obvious why the parser is emitting the second error message (apart from the )
); to understand why it’s doing this we should keep the basic structure of expr_binding_power
in mind:
fn expr_binding_power() -> Option<CompletedMarker> {
let lhs = lhs()?;
loop {
let op = get_operator();
expr_binding_power();
}
Some(lhs)
}
Let’s imagine we’re parsing the 1+
in (1+
. We successfully get a left-hand side, and now enter the loop and extract an operator. We now recurse, and fail to get a left-hand side. This means we return None
from this inner call. Here’s where the problem lies: rather than exiting because we couldn’t get a right-hand side, we just loop around again and try to get another operator. Let’s add a check to expr_binding_power
to fix this issue:
// expr.rs
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) -> Option<CompletedMarker> {
let mut lhs = lhs(p)?;
loop {
// snip
let m = lhs.precede(p);
let parsed_rhs = expr_binding_power(p, right_binding_power).is_some();
lhs = m.complete(p, SyntaxKind::InfixExpr);
if !parsed_rhs {
break;
}
}
Some(lhs)
}
Instead of manually typing out that scenario from before in the REPL to find out if we’ve squashed the bug, we can write a test to ensure we never regress on this specific scenario:
#[cfg(test)]
mod tests {
// snip
#[test]
fn do_not_parse_operator_if_gettting_rhs_failed() {
check(
"(1+",
expect![[r#"
Root@0..3
ParenExpr@0..3
LParen@0..1 "("
InfixExpr@1..3
Literal@1..2
Number@1..2 "1"
Plus@2..3 "+"
error at 2..3: expected number, identifier, ‘-’ or ‘(’
error at 2..3: expected ‘)’"#]],
);
}
// snip
}
It makes sense that we would get two errors, since there really are two errors here: that of the binary expression missing a right-hand side, and that of an unclosed parenthesis.
$ cargo t -q --lib
running 18 tests
..................
test result: ok. 18 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
running 26 tests
..........................
test result: ok. 26 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Refactoring
We’re almost done! Currently, Eldiro uses TokenKind
s exclusively in the lexer, and converts them to SyntaxKind
s for use everywhere else. We’d be better off with using TokenKind
for things that can’t be any of the node variants exclusive to SyntaxKind
, such as the SyntaxKind
passed into Parser::at
. This is a straightforward change to make, so I’ll just show the changed sections without any explanation:
// grammar.rs
use crate::parser::marker::CompletedMarker;
use crate::parser::Parser;
use lexer::TokenKind;
use syntax::SyntaxKind;
// parser.rs
use crate::event::Event;
use crate::grammar;
use crate::source::Source;
use lexer::{Token, TokenKind};
use marker::Marker;
use std::mem;
use syntax::SyntaxKind;
const RECOVERY_SET: [TokenKind; 1] = [TokenKind::LetKw];
pub(crate) struct Parser<'t, 'input> {
source: Source<'t, 'input>,
events: Vec<Event>,
expected_kinds: Vec<TokenKind>,
}
impl<'t, 'input> Parser<'t, 'input> {
// snip
pub(crate) fn expect(&mut self, kind: TokenKind) {
// snip
}
pub(crate) fn error(&mut self) {
let current_token = self.source.peek_token();
let (found, range) = if let Some(Token { kind, range, .. }) = current_token {
(Some(*kind), *range)
} else {
// If we’re at the end of the input we use the range of the very last token in the
// input.
(None, self.source.last_token_range().unwrap())
};
// snip
}
// snip
pub(crate) fn at(&mut self, kind: TokenKind) -> bool {
// snip
}
fn at_set(&mut self, set: &[TokenKind]) -> bool {
// snip
}
// snip
fn peek(&mut self) -> Option<TokenKind> {
// snip
}
}
// crates/lexer/src/token_kind.rs
use logos::Logos;
use std::fmt;
// snip
impl TokenKind {
pub fn is_trivia(self) -> bool {
matches!(self, Self::Whitespace | Self::Comment)
}
}
impl fmt::Display for TokenKind {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.write_str(match self {
Self::Whitespace => "whitespace",
Self::FnKw => "‘fn’",
Self::LetKw => "‘let’",
Self::Ident => "identifier",
Self::Number => "number",
Self::Plus => "‘+’",
Self::Minus => "‘-’",
Self::Star => "‘*’",
Self::Slash => "‘/’",
Self::Equals => "‘=’",
Self::LParen => "‘(’",
Self::RParen => "‘)’",
Self::LBrace => "‘{’",
Self::RBrace => "‘}’",
Self::Comment => "comment",
Self::Error => "an unrecognized token",
})
}
}
Delete SyntaxKind::is_trivia
and SyntaxKind
’s Display
implementation. Replace all uses of SyntaxKind
with TokenKind
in parse_error.rs
and sink.rs
. Now follow the Rust compiler’s errors to fix all the type errors that have now cropped up across the project. Sorry for this having a bit of a ‘this is left as an exercise for the reader’ feel, but I think that making this part any longer than it already is for the sake of simple substitutions is not a good idea.
Take a look at the diff of the commit that makes this change if you get confused or stuck.
Conclusion
Hopefully this wasn’t too difficult to follow! We’ll create structures to work with parsed code in the next part.
I mention this because, in my experience, this case is very rare; my most common typos with infix operators are mistyping one for the other, not as something completely different. ↩︎