Refactoring expr_binding_power
The right-hand side of the let lhs = ...
in expr_binding_power
is getting quite long, so let’s extract it to its own function:
// expr.rs
use super::marker::CompletedMarker;
use super::Parser;
use crate::lexer::SyntaxKind;
// snip
fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
let mut lhs = if let Some(lhs) = lhs(p) {
lhs
} else {
return; // we’ll handle errors later.
};
// snip
}
fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
let cm = match p.peek() {
Some(SyntaxKind::Number) => {
// snip
}
// all the other arms are also here, unchanged
_ => return None,
};
Some(cm)
}
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s also extract the parsers for each of the match
arms in lhs
:
fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
let cm = match p.peek() {
Some(SyntaxKind::Number) => literal(p),
Some(SyntaxKind::Ident) => variable_ref(p),
Some(SyntaxKind::Minus) => prefix_expr(p),
Some(SyntaxKind::LParen) => paren_expr(p),
_ => return None,
};
Some(cm)
}
// snip
fn literal(p: &mut Parser) -> CompletedMarker {
assert_eq!(p.peek(), Some(SyntaxKind::Number));
let m = p.start();
p.bump();
m.complete(p, SyntaxKind::Literal)
}
fn variable_ref(p: &mut Parser) -> CompletedMarker {
assert_eq!(p.peek(), Some(SyntaxKind::Ident));
let m = p.start();
p.bump();
m.complete(p, SyntaxKind::VariableRef)
}
fn prefix_expr(p: &mut Parser) -> CompletedMarker {
assert_eq!(p.peek(), Some(SyntaxKind::Minus));
let m = p.start();
let op = PrefixOp::Neg;
let ((), right_binding_power) = op.binding_power();
// Eat the operator’s token.
p.bump();
expr_binding_power(p, right_binding_power);
m.complete(p, SyntaxKind::PrefixExpr)
}
fn paren_expr(p: &mut Parser) -> CompletedMarker {
assert_eq!(p.peek(), Some(SyntaxKind::LParen));
let m = p.start();
p.bump();
expr_binding_power(p, 0);
assert_eq!(p.peek(), Some(SyntaxKind::RParen));
p.bump();
m.complete(p, SyntaxKind::ParenExpr)
}
Writing out that assert_eq!
at the start of each subparser is getting annoying. Let’s write a helper method on Parser
to make this easier:
// parser.rs
impl<'l, 'input> Parser<'l, 'input> {
// snip
fn at(&mut self, kind: SyntaxKind) -> bool {
self.peek() == Some(kind)
}
// snip
}
We can now update all those assert_eq!
s:
// expr.rs
fn literal(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::Number));
// snip
}
fn variable_ref(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::Ident));
// snip
}
fn prefix_expr(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::Minus));
// snip
}
fn paren_expr(p: &mut Parser) -> CompletedMarker {
assert!(p.at(SyntaxKind::LParen));
// snip
assert!(p.at(SyntaxKind::RParen));
p.bump();
// snip
}
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
‘Lexeme’ versus ‘token’
I was under the impression that a ‘token’ is an identifier of a given chunk of the input (in our case a SyntaxKind
), and that a ‘lexeme’ is a token plus the text that token applies to. Once again it seems that my assumptions about terminology are incorrect: although this StackOverflow answer agrees with my definition of ‘token’, it explains that the word ‘token’ can also be used to describe what I’ve been calling a ‘lexeme’. My definition of ‘lexeme’ is incorrect; the answer says that ‘lexeme’ refers to the text a token applies to, and that only.
Use your editor’s project-wide search and replace to rename Lexeme
to Token
, lexeme
to token
and 'l
to 't
. Make sure to run cargo fmt
afterwards to maintain formatting.
See the commit where this change is made if you have any trouble.
Removing unneeded fields from Event
When we want to add a token to the current branch, we use this event:
Event::AddToken {
kind: SyntaxKind::Foo,
text: "bar".into(),
}
We get the SyntaxKind
and text of this token from the source, which stores an array of tokens internally. Once parsing is complete, the sink goes through each event and processes it. The sink also stores the same array of tokens.
This means that the the parser doesn’t need to tell the sink the SyntaxKind
and text of each token it’s adding, since the sink already has the data to work this out. Let’s remove the unnecessary fields from Event::AddToken
:
// event.rs
use crate::lexer::SyntaxKind;
#[derive(Debug, Clone, PartialEq)]
pub(super) enum Event {
StartNode {
kind: SyntaxKind,
forward_parent: Option<usize>,
},
AddToken,
FinishNode,
Placeholder,
}
This has reduced the size of Event
from 32 bytes to 24 bytes. Let’s update Parser::bump
to match:
// parser.rs
impl<'t, 'input> Parser<'t, 'input> {
// snip
fn bump(&mut self) {
self.source.next_token().unwrap();
self.events.push(Event::AddToken);
}
// snip
}
We also need to change Sink
:
// sink.rs
use super::event::Event;
use crate::lexer::Token;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::mem;
// snip
impl<'t, 'input> Sink<'t, 'input> {
// snip
pub(super) fn finish(mut self) -> GreenNode {
for idx in 0..self.events.len() {
match mem::replace(&mut self.events[idx], Event::Placeholder) {
// snip
Event::AddToken => self.token(),
// snip
}
self.eat_trivia();
}
self.builder.finish()
}
fn eat_trivia(&mut self) {
while let Some(token) = self.tokens.get(self.cursor) {
if !token.kind.is_trivia() {
break;
}
self.token();
}
}
fn token(&mut self) {
let Token { kind, text } = self.tokens[self.cursor];
self.builder
.token(EldiroLanguage::kind_to_raw(kind), text.into());
self.cursor += 1;
}
}
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Binary/infix and unary/prefix
I’ve used these terms inconsistently throughout the series, and thought it might be a good time to clean this up.
- Binary operator means ‘an operator with two operands’
- Infix operator means ‘an operator placed between its operands’
- Unary operator means ‘an operator with one operand’
- Prefix operator means ‘an operator placed before its operands’
I’ve used the ‘binary’ terminology in SyntaxKind
, but used ‘infix’ for the type representing these binary operators in expr.rs
. I’ve used the ‘prefix’ terminology in both SyntaxKind
and expr.rs
.
To stay consistent we should name things after the position of the operator, or the number of operands. The parser has to worry about the position, but future parts of Eldiro (such as an interpreter) have to worry about the number of operands. To convey this we’ll rename SyntaxKind::BinaryExpr
to SyntaxKind::InfixExpr
, InfixOp
to BinaryOp
and PrefixOp
to UnaryOp
. Make sure you change the tests too! If you get stuck feel free to look at the relevant commit.