We’ve got a number of topics to cover this time, so let’s get started.
A bug fix
u/mozjag pointed out on Reddit that the regex we’re using to lex identifiers mandates that they have a length of two, when we could also allow identifiers with a single character. Let’s write a test for lexing a single-letter identifier to verify the bug is actually present:
// lexer.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn lex_single_char_identifier() {
check("x", SyntaxKind::Ident);
}
// snip
}
$ cargo t -q
running 21 tests
..............F......
failures:
---- lexer::tests::lex_single_char_identifier stdout ----
thread 'lexer::tests::lex_single_char_identifier' panicked at 'assertion failed: `(left == right)`
left: `Some((Error, "x"))`,
right: `Some((Ident, "x"))`', crates/eldiro/src/lexer.rs:78:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
lexer::tests::lex_single_char_identifier
test result: FAILED. 20 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out
We need to change the +
in the Ident
regex to be a *
instead, so that zero or more repetitions are allowed instead of one or more:
#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
// snip
#[regex("[A-Za-z][A-Za-z0-9]*")]
Ident,
// snip
}
$ cargo t -q
.....................
test result: ok. 21 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Negation
Up to now, all the operators we support have been binary, meaning that they have two operands. We’d like to have unary operators too, which are operators that have one operand. The operator we’ll add here is the prefix unary minus, as used for negation – the -
in -10
, for example. To start, we’ll need another node type:1
#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
// snip
Root,
BinOp,
PrefixExpr,
}
Let’s rename BinOp
to BinaryExpr
for consistency (after all, we aren’t abbreviating Prefix
):
#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
// snip
Root,
BinaryExpr,
PrefixExpr,
}
// expr.rs
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
// snip
loop {
// snip
p.start_node_at(checkpoint, SyntaxKind::BinaryExpr);
// snip
}
}
// snip
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_simple_binary_expression() {
check(
"1+2",
expect![[r#"
Root@0..3
BinaryExpr@0..3
Number@0..1 "1"
Plus@1..2 "+"
Number@2..3 "2""#]],
);
}
#[test]
fn parse_left_associative_binary_expression() {
check(
"1+2+3+4",
expect![[r#"
Root@0..7
BinaryExpr@0..7
BinaryExpr@0..5
BinaryExpr@0..3
Number@0..1 "1"
Plus@1..2 "+"
Number@2..3 "2"
Plus@3..4 "+"
Number@4..5 "3"
Plus@5..6 "+"
Number@6..7 "4""#]],
);
}
#[test]
fn parse_binary_expression_with_mixed_binding_power() {
check(
"1+2*3-4",
expect![[r#"
Root@0..7
BinaryExpr@0..7
BinaryExpr@0..5
Number@0..1 "1"
Plus@1..2 "+"
BinaryExpr@2..5
Number@2..3 "2"
Star@3..4 "*"
Number@4..5 "3"
Minus@5..6 "-"
Number@6..7 "4""#]],
);
}
}
We can write a test for parsing an expression with the negation operator:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_negation() {
check(
"-10",
expect![[r#"
Root@0..3
PrefixExpr@0..3
Minus@0..1 "-"
Number@1..3 "10""#]],
);
}
}
Now that an expression can start with something other than a number or an identifier, we need to modify that match
at the start of expr_binding_power
:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let checkpoint = p.checkpoint();
match p.peek() {
Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
Some(SyntaxKind::Minus) => todo!(),
_ => {}
}
// snip
}
First, we should determine the binding power of the prefix operator we’ve found. To do this, we need to define a type that represents all the possible prefix operators we might encounter – PrefixOp
sounds like a good name. Also, we should rename Op
to InfixOp
so that it’s differentiated from PrefixOp
:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let checkpoint = p.checkpoint();
match p.peek() {
Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
Some(SyntaxKind::Minus) => {
let op = PrefixOp::Neg;
let ((), right_binding_power) = op.binding_power();
}
_ => {}
}
loop {
let op = match p.peek() {
Some(SyntaxKind::Plus) => InfixOp::Add,
Some(SyntaxKind::Minus) => InfixOp::Sub,
Some(SyntaxKind::Star) => InfixOp::Mul,
Some(SyntaxKind::Slash) => InfixOp::Div,
_ => return, // we’ll handle errors later.
};
// snip
}
}
enum InfixOp {
Add,
Sub,
Mul,
Div,
}
impl InfixOp {
fn binding_power(&self) -> (u8, u8) {
// snip
}
}
enum PrefixOp {
Neg,
}
impl PrefixOp {
fn binding_power(&self) -> ((), u8) {
match self {
Self::Neg => ((), 5),
}
}
}
Note how the binding power of the negation operator is higher than that of any infix operator. This means that -10 + 5
is parsed as (-10) + 5
instead of -(10 + 5)
.
We now do the same as in the case of an infix operator:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let checkpoint = p.checkpoint();
match p.peek() {
Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
Some(SyntaxKind::Minus) => {
let op = PrefixOp::Neg;
let ((), right_binding_power) = op.binding_power();
// Eat the operator’s token.
p.bump();
p.start_node_at(checkpoint, SyntaxKind::PrefixExpr);
expr_binding_power(p, right_binding_power);
p.finish_node();
}
_ => {}
}
// snip
}
$ cargo t -q
running 22 tests
......................
test result: ok. 22 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s add a test to make sure precedence is working as we would hope:
#[cfg(test)]
mod tests {
// snip
#[test]
fn negation_has_higher_binding_power_than_infix_operators() {
check(
"-20+20",
expect![[r#"
Root@0..6
BinaryExpr@0..6
PrefixExpr@0..3
Minus@0..1 "-"
Number@1..3 "20"
Plus@3..4 "+"
Number@4..6 "20""#]],
);
}
}
$ cargo t -q
running 23 tests
.......................
test result: ok. 23 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Parentheses
Our lexer doesn’t lex parentheses, so we should start off by adding that:
// lexer.rs
#[derive(Debug, Copy, Clone, PartialEq, Logos, FromPrimitive, ToPrimitive)]
pub(crate) enum SyntaxKind {
// snip
#[token("(")]
LParen,
#[token(")")]
RParen,
// snip
}
// snip
#[cfg(test)]
mod tests {
// snip
#[test]
fn lex_left_parenthesis() {
check("(", SyntaxKind::LParen);
}
#[test]
fn lex_right_parenthesis() {
check(")", SyntaxKind::RParen);
}
// snip
}
$ cargo t -q
running 25 tests
.........................
test result: ok. 25 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s write a test for parsing a bunch of nested parentheses:
// expr.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_nested_parentheses() {
check(
"((((((10))))))",
expect![[r#"
Root@0..14
LParen@0..1 "("
LParen@1..2 "("
LParen@2..3 "("
LParen@3..4 "("
LParen@4..5 "("
LParen@5..6 "("
Number@6..8 "10"
RParen@8..9 ")"
RParen@9..10 ")"
RParen@10..11 ")"
RParen@11..12 ")"
RParen@12..13 ")"
RParen@13..14 ")""#]],
);
}
}
Let’s also add a test to see if adding parentheses allows us to change precedence as we would expect:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parentheses_affect_precedence() {
check(
"5*(2+1)",
expect![[r#"
Root@0..7
BinaryExpr@0..7
Number@0..1 "5"
Star@1..2 "*"
LParen@2..3 "("
BinaryExpr@3..6
Number@3..4 "2"
Plus@4..5 "+"
Number@5..6 "1"
RParen@6..7 ")""#]],
);
}
}
Once again we need to add another case to the match
statement at the top of expr_binding_power
:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let checkpoint = p.checkpoint();
match p.peek() {
Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
Some(SyntaxKind::Minus) => {
// snip
}
Some(SyntaxKind::LParen) => todo!(),
_ => {}
}
// snip
}
The first thing we should do is bump
the left parenthesis onto the current branch:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let checkpoint = p.checkpoint();
match p.peek() {
Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
Some(SyntaxKind::Minus) => {
// snip
}
Some(SyntaxKind::LParen) => {
p.bump();
todo!();
}
_ => {}
}
// snip
}
If you think about it, we can parse the contents of the parenthesised expression by calling expr_binding_power
with minimum_binding_power
set to 0
, since parenthesising something ‘resets’ precedence inside it. All we need to do after writing that is add another p.bump()
to account for the closing parenthesis:
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let checkpoint = p.checkpoint();
match p.peek() {
Some(SyntaxKind::Number) | Some(SyntaxKind::Ident) => p.bump(),
Some(SyntaxKind::Minus) => {
// snip
}
Some(SyntaxKind::LParen) => {
p.bump();
expr_binding_power(p, 0);
assert_eq!(p.peek(), Some(SyntaxKind::RParen));
p.bump();
}
_ => {}
}
// snip
}
$ cargo t -q
running 27 tests
...........................
test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Next time
In the next part we’ll restructure how the parser works internally so Eldiro can support a seemingly-basic feature: whitespace.
It’s more appropriate to use
PrefixExpr
instead ofUnaryExpr
because a unary operator could also be postfix, meaning that you add it after the operand. For example,?
in Rust is a postfix operator. It makes sense to differentiate the two because we might have to treat them differently in future. ↩︎