The simple but annoying approach
We could check for the presence of whitespace after every token:
// do some stuff
if p.peek() == Some(SyntaxKind::Whitespace) {
p.bump();
}
// do some more stuff
It gets annoying to type that out each time, so we could define a eat_whitespace
method:
// Only hypothetical; don’t write this!
impl Parser {
pub(crate) fn eat_whitespace(&mut self) {
if p.peek() == Some(SyntaxKind::Whitespace) {
p.bump();
}
}
}
Nevertheless, being forced to write out a p.eat_whitespace()
call every single time we add a token to a branch is annoying.
The complex but invisible approach
Instead of manually handling whitespace, we can create our own abstraction over Rowan in the form of events – every time Parser::start_node
is called, for example, we’ll push to a Vec
of Event
enums. The same goes for all the other methods on Parser
that interact with Rowan – each of them gets its own Event
enum variant. After we’re done with parsing, we can construct a tree by applying these events to a GreenNodeBuilder
. Crucially, it is at this step that we locate the whitespace tokens missing from the parser’s events, allowing us to add them to the syntax tree as necessary. This way we can handle whitespace in one place, divorcing it from the parser.
Events can make our parser faster if we have a previous parse tree lying around; rather than constructing a new parse tree every time, we can patch an existing one with the events of the new parse. Another benefit that events bring is their decoupling of the parser from the syntax tree implementation. This makes both the parser and especially the syntax tree easier to modify in future. It also gives us the opportunity to share the parser between multiple syntax tree implementations.
Making our parser event-based
Let’s start by defining what an event is:
// parser.rs
mod event;
mod expr;
// src/parser/event.rs
#[derive(Debug)]
pub(super) enum Event {}
The only things our parser does that could become events are:
- starting nodes
- starting nodes at a given checkpoint
- adding tokens
- finishing nodes
Let’s add variants to Event
for each of these:
use crate::lexer::SyntaxKind;
use rowan::SmolStr;
#[derive(Debug)]
pub(super) enum Event {
StartNode { kind: SyntaxKind },
StartNodeAt { kind: SyntaxKind, checkpoint: usize },
AddToken { kind: SyntaxKind, text: SmolStr },
FinishNode,
}
Let’s now adjust the helper methods on our parser to add events instead of taking action directly. We won’t need the builder
field on Parser
anymore either:
// parser.rs
mod event;
mod expr;
use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use std::iter::Peekable;
pub struct Parser<'a> {
lexer: Peekable<Lexer<'a>>,
events: Vec<Event>,
}
impl<'a> Parser<'a> {
pub fn new(input: &'a str) -> Self {
Self {
lexer: Lexer::new(input).peekable(),
events: Vec::new(),
}
}
// snip
fn start_node(&mut self, kind: SyntaxKind) {
self.events.push(Event::StartNode { kind });
}
fn start_node_at(&mut self, checkpoint: usize, kind: SyntaxKind) {
self.events.push(Event::StartNodeAt { kind, checkpoint });
}
fn finish_node(&mut self) {
self.events.push(Event::FinishNode);
}
fn bump(&mut self) {
let (kind, text) = self.lexer.next().unwrap();
self.events.push(Event::AddToken {
kind,
text: text.into(),
});
}
fn checkpoint(&self) -> usize {
self.events.len()
}
// snip
}
Now that our parser is constructing events instead of a parse tree, we need a way to get a parse tree out of that Vec<Event>
:
mod event;
mod expr;
mod sink;
use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;
use std::iter::Peekable;
// snip
impl<'a> Parser<'a> {
// snip
pub fn parse(mut self) -> Parse {
self.start_node(SyntaxKind::Root);
expr(&mut self);
self.finish_node();
let sink = Sink::new(self.events);
Parse {
green_node: sink.finish(),
}
}
// snip
}
// src/parser/sink.rs
use super::event::Event;
use crate::syntax::EldiroLanguage;
use rowan::{Checkpoint, GreenNode, GreenNodeBuilder, Language};
pub(super) struct Sink {
builder: GreenNodeBuilder<'static>,
events: Vec<Event>,
}
impl Sink {
pub(super) fn new(events: Vec<Event>) -> Self {
Self {
builder: GreenNodeBuilder::new(),
events,
}
}
pub(super) fn finish(mut self) -> GreenNode {
for event in self.events {
match event {
Event::StartNode { kind } => {
self.builder.start_node(EldiroLanguage::kind_to_raw(kind))
}
Event::StartNodeAt { kind, checkpoint } => self
.builder
.start_node_at(Checkpoint(checkpoint), EldiroLanguage::kind_to_raw(kind)),
Event::AddToken { kind, text } => {
self.builder.token(EldiroLanguage::kind_to_raw(kind), text)
}
Event::FinishNode => self.builder.finish_node(),
}
}
self.builder.finish()
}
}
What we’ve done here is recreate what all those helper methods on Parser
did before we modified them to use events; for example, when we encounter the FinishNode
event we call self.builder.finish_node()
, which is what we did earlier in Parser::finish_node
. The code doesn’t compile, though, because the usize
that rowan::Checkpoint
contains isn’t public, so we can’t create an instance manually. There isn’t any way for us to create a Checkpoint
, which means we’ll have to adjust our code to avoid using GreenNodeBuilder::start_node_at
.
Currently we rely on Rowan to wrap all the nodes added since the last call to checkpoint
; these nodes are wrapped in the SyntaxKind
the event has specified. Rather than doing this, we can instead ‘rewrite history’ by moving the StartNodeAt
event to earlier in events
where checkpoint
was called:
// event.rs
#[derive(Debug, Clone)]
pub(super) enum Event {
// snip
}
// sink.rs
use super::event::Event;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};
// snip
impl Sink {
// snip
pub(super) fn finish(mut self) -> GreenNode {
let mut reordered_events = self.events.clone();
for (idx, event) in self.events.into_iter().enumerate() {
if let Event::StartNodeAt { kind, checkpoint } = event {
reordered_events.remove(idx);
reordered_events.insert(checkpoint, Event::StartNode { kind });
}
}
for event in reordered_events {
match event {
Event::StartNode { kind } => {
self.builder.start_node(EldiroLanguage::kind_to_raw(kind))
}
Event::StartNodeAt { .. } => unreachable!(),
Event::AddToken { kind, text } => {
self.builder.token(EldiroLanguage::kind_to_raw(kind), text)
}
Event::FinishNode => self.builder.finish_node(),
}
}
self.builder.finish()
}
}
Note how we modify a clone of the original events
because we can’t mutate something while we’re iterating over it (you can’t obtain a unique borrow to something while a shared borrow is outstanding).
Preparation
Let’s now get ready to implement automatic consumption of whitespace. The source of lexemes in our parser is currently a Peekable<Lexer<'_>>
; our sink also needs to have its own copy so that it can reconstruct where the whitespace that is missing from events
is located. Creating the Peekable<Lexer<'_>>
twice would not be ideal, since it would mean unnecessarily lexing the input twice. Instead, we should lex the input just once and collect the results, giving both the parser and the sink a reference to the resulting Vec<(SyntaxKind, &str)>
.
The first modification we should make is to create a separate parse
function that handles lexing, as well as the creation of the parser as well as the sink:
// parser.rs
pub fn parse(input: &str) -> Parse {
let lexemes: Vec<_> = Lexer::new(input).collect();
let parser = Parser::new(&lexemes);
let events = parser.parse();
let sink = Sink::new(&lexemes, events);
Parse {
green_node: sink.finish(),
}
}
Now Parser
doesn’t need to be pub
, since it’s abstracted over by parse
:
struct Parser<'a> {
lexer: Peekable<Lexer<'a>>,
events: Vec<Event>,
}
While we’re here we may as well replace lexer
with a lexemes
field and a cursor
to represent what point we’re up to:
struct Parser<'l, 'input> {
lexemes: &'l [(SyntaxKind, &'input str)],
cursor: usize,
events: Vec<Event>,
}
We have to update some of Parser
’s helper functions, as well as its constructor:
impl<'l, 'input> Parser<'l, 'input> {
fn new(lexemes: &'l [(SyntaxKind, &'input str)]) -> Self {
Self {
lexemes,
cursor: 0,
events: Vec::new(),
}
}
fn parse(mut self) -> Vec<Event> {
self.start_node(SyntaxKind::Root);
expr(&mut self);
self.finish_node();
self.events
}
// snip
fn bump(&mut self) {
let (kind, text) = self.lexemes[self.cursor];
self.cursor += 1;
self.events.push(Event::AddToken {
kind,
text: text.into(),
});
}
// snip
fn peek(&self) -> Option<SyntaxKind> {
self.lexemes.get(self.cursor).map(|(kind, _)| *kind)
}
}
Next we have to update Sink
to hold its own copy of the lexemes:
// sink.rs
use super::event::Event;
use crate::lexer::SyntaxKind;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};
pub(super) struct Sink<'l, 'input> {
builder: GreenNodeBuilder<'static>,
lexemes: &'l [(SyntaxKind, &'input str)],
events: Vec<Event>,
}
impl<'l, 'input> Sink<'l, 'input> {
pub(super) fn new(lexemes: &'l [(SyntaxKind, &'input str)], events: Vec<Event>) -> Self {
Self {
builder: GreenNodeBuilder::new(),
lexemes,
events,
}
}
// snip
}
Our CLI isn’t compiling because it refers to Parser
, which we made private earlier. Let’s replace this reference with our new parse
function:
// crates/eldiro-cli/src/main.rs
use eldiro::parser::parse;
use std::io::{self, Write};
fn main() -> io::Result<()> {
let stdin = io::stdin();
let mut stdout = io::stdout();
let mut input = String::new();
loop {
write!(stdout, "→ ")?;
stdout.flush()?;
stdin.read_line(&mut input)?;
let parse = parse(&input);
println!("{}", parse.debug_tree());
input.clear();
}
}
We have an unused import in crates/eldiro/src/parser.rs
:
mod event;
mod expr;
mod sink;
use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;
Finally, the last error remaining is the check
function, which we need to update to use parse
:
#[cfg(test)]
fn check(input: &str, expected_tree: expect_test::Expect) {
let parse = parse(input);
expected_tree.assert_eq(&parse.debug_tree());
}
$ cargo t -q
running 27 tests
...........................
test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s abstract away the combination of SyntaxKind
and &str
into its own structure, since the two are frequently being passed together (something called a data clump):
// lexer.rs
impl<'a> Iterator for Lexer<'a> {
type Item = Lexeme<'a>;
fn next(&mut self) -> Option<Self::Item> {
let kind = self.inner.next()?;
let text = self.inner.slice();
Some(Self::Item { kind, text })
}
}
#[derive(Debug, PartialEq)]
pub(crate) struct Lexeme<'a> {
pub(crate) kind: SyntaxKind,
pub(crate) text: &'a str,
}
// snip
#[cfg(test)]
mod tests {
// snip
fn check(input: &str, kind: SyntaxKind) {
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Lexeme { kind, text: input }));
}
// snip
}
// parser.rs
use crate::lexer::{Lexeme, Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;
// snip
struct Parser<'l, 'input> {
lexemes: &'l [Lexeme<'input>],
cursor: usize,
events: Vec<Event>,
}
impl<'l, 'input> Parser<'l, 'input> {
fn new(lexemes: &'l [Lexeme<'input>]) -> Self {
Self {
lexemes,
cursor: 0,
events: Vec::new(),
}
}
// snip
fn bump(&mut self) {
let Lexeme { kind, text } = self.lexemes[self.cursor];
self.cursor += 1;
self.events.push(Event::AddToken {
kind,
text: text.into(),
});
}
// snip
fn peek(&self) -> Option<SyntaxKind> {
self.lexemes
.get(self.cursor)
.map(|Lexeme { kind, .. }| *kind)
}
}
// sink.rs
use super::event::Event;
use crate::lexer::Lexeme;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};
pub(super) struct Sink<'l, 'input> {
builder: GreenNodeBuilder<'static>,
lexemes: &'l [Lexeme<'input>],
events: Vec<Event>,
}
impl<'l, 'input> Sink<'l, 'input> {
pub(super) fn new(lexemes: &'l [Lexeme<'input>], events: Vec<Event>) -> Self {
Self {
builder: GreenNodeBuilder::new(),
lexemes,
events,
}
}
// snip
}
Implementation of whitespace-skipping
Now the machinery is in place to have the parser automatically skip whitespace it encounters. Let’s write a test so we know when we get it working:
// parser.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_whitespace() {
check(
" ",
expect![[r#"
Root@0..3
Whitespace@0..3 " ""#]],
);
}
}
All the functions that interact with lexemes
need to skip past any whitespace they find. Note that we don’t add events for whitespace; instead, we just ignore it and let the sink take care of it:
impl<'l, 'input> Parser<'l, 'input> {
// snip
fn bump(&mut self) {
self.eat_whitespace();
let Lexeme { kind, text } = self.lexemes[self.cursor];
self.cursor += 1;
self.events.push(Event::AddToken {
kind,
text: text.into(),
});
}
// snip
fn peek(&mut self) -> Option<SyntaxKind> {
self.eat_whitespace();
self.peek_raw()
}
fn eat_whitespace(&mut self) {
while self.peek_raw() == Some(SyntaxKind::Whitespace) {
self.cursor += 1;
}
}
fn peek_raw(&self) -> Option<SyntaxKind> {
self.lexemes
.get(self.cursor)
.map(|Lexeme { kind, .. }| *kind)
}
}
To make this work we need to update the sink to automatically add any whitespace it encounters. Let’s start by adding a cursor to the sink:
// sink.rs
pub(super) struct Sink<'l, 'input> {
builder: GreenNodeBuilder<'static>,
lexemes: &'l [Lexeme<'input>],
cursor: usize,
events: Vec<Event>,
}
impl<'l, 'input> Sink<'l, 'input> {
pub(super) fn new(lexemes: &'l [Lexeme<'input>], events: Vec<Event>) -> Self {
Self {
builder: GreenNodeBuilder::new(),
lexemes,
cursor: 0,
events,
}
}
// snip
}
Let’s bump this cursor every time a token is added:
use super::event::Event;
use crate::lexer::{Lexeme, SyntaxKind};
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language, SmolStr};
// snip
impl<'l, 'input> Sink<'l, 'input> {
// snip
pub(super) fn finish(mut self) -> GreenNode {
// snip
for event in reordered_events {
match event {
// snip
Event::AddToken { kind, text } => self.token(kind, text),
// snip
}
}
// snip
}
fn token(&mut self, kind: SyntaxKind, text: SmolStr) {
self.builder.token(EldiroLanguage::kind_to_raw(kind), text);
self.cursor += 1;
}
}
This gives us a compiler error:
error[E0382]: borrow of partially moved value: `self`
--> crates/eldiro/src/parser/sink.rs:39:51
|
26 | for (idx, event) in self.events.into_iter().enumerate() {
| ----------- `self.events` partially moved due to this method call
...
39 | Event::AddToken { kind, text } => self.token(kind, text),
| ^^^^ value borrowed here after partial move
|
note: this function consumes the receiver `self` by taking ownership of it, which moves `self.events`
--> /home/me/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/collect.rs:232:18
|
232 | fn into_iter(self) -> Self::IntoIter;
| ^^^^
= note: partial move occurs because `self.events` has type `Vec<Event>`, which does not implement the `Copy` trait
We can appease the borrow checker by replacing the .into_iter()
call with .iter()
and sprinkling in a few *
s here and there:
impl<'l, 'input> Sink<'l, 'input> {
// snip
pub(super) fn finish(mut self) -> GreenNode {
// snip
for (idx, event) in self.events.iter().enumerate() {
if let Event::StartNodeAt { kind, checkpoint } = event {
reordered_events.remove(idx);
reordered_events.insert(*checkpoint, Event::StartNode { kind: *kind });
}
}
// snip
}
// snip
}
Now that Eldiro is compiling, we can finally get to the interesting part: whitespace skipping. Let’s implement an eat_whitespace
method and call it after processing each event:
impl<'l, 'input> Sink<'l, 'input> {
// snip
pub(super) fn finish(mut self) -> GreenNode {
// snip
for event in reordered_events {
match event {
// snip
}
self.eat_whitespace();
}
// snip
}
fn eat_whitespace(&mut self) {
while let Some(lexeme) = self.lexemes.get(self.cursor) {
if lexeme.kind != SyntaxKind::Whitespace {
break;
}
self.token(lexeme.kind, lexeme.text.into());
}
}
// snip
}
$ cargo t -q
running 28 tests
............................
test result: ok. 28 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
We should write some tests for parsing numbers surrounded by whitespace on different sides, since the only test we currently have related to whitespace only contains whitespace, and nothing else:
// expr.rs
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_number_preceded_by_whitespace() {
check(
" 9876",
expect![[r#"
Root@0..7
Whitespace@0..3 " "
Number@3..7 "9876""#]],
);
}
#[test]
fn parse_number_followed_by_whitespace() {
check(
"999 ",
expect![[r#"
Root@0..6
Number@0..3 "999"
Whitespace@3..6 " ""#]],
);
}
#[test]
fn parse_number_surrounded_by_whitespace() {
check(
" 123 ",
expect![[r#"
Root@0..9
Whitespace@0..1 " "
Number@1..4 "123"
Whitespace@4..9 " ""#]],
);
}
// snip
}
$ cargo t -q
running 31 tests
...............................
test result: ok. 31 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Let’s write one more test to make sure whitespace in binary expressions is working:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_binary_expression_with_whitespace() {
check(
" 1 + 2* 3 ",
expect![[r#"
Root@0..12
Whitespace@0..1 " "
BinaryExpr@1..12
Number@1..2 "1"
Whitespace@2..3 " "
Plus@3..4 "+"
Whitespace@4..7 " "
BinaryExpr@7..12
Number@7..8 "2"
Star@8..9 "*"
Whitespace@9..10 " "
Number@10..11 "3"
Whitespace@11..12 " ""#]],
);
}
// snip
}
$ cargo t -q
running 32 tests
................................
test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Cool! And there you have it: we’ve managed to implement automatic whitespace skipping without having to touch the actual parser’s code, only the helper methods and sink.
Refactoring
It’s ugly how the parser has to manage a cursor that it has to manually advance through the input, skipping whitespace as necessary. Instead, we should extract this into another type; let’s call it Source
to stay consistent with Sink
:
// parser.rs
mod event;
mod expr;
mod sink;
mod source;
use crate::lexer::{Lexeme, Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;
use source::Source;
// src/parser/source.rs
use crate::lexer::Lexeme;
pub(super) struct Source<'l, 'input> {
lexemes: &'l [Lexeme<'input>],
cursor: usize,
}
Let’s update Parser
to use Source
with the API we wish we had; this way we avoid the common pitfall of writing an API, trying to use it, only to find that it needs to be changed:
// parser.rs
struct Parser<'l, 'input> {
source: Source<'l, 'input>,
events: Vec<Event>,
}
impl<'l, 'input> Parser<'l, 'input> {
fn new(lexemes: &'l [Lexeme<'input>]) -> Self {
Self {
source: Source::new(lexemes),
events: Vec::new(),
}
}
// snip
fn bump(&mut self) {
let Lexeme { kind, text } = self.source.next_lexeme().unwrap();
self.events.push(Event::AddToken {
kind: *kind,
text: (*text).into(),
});
}
// snip
fn peek(&mut self) -> Option<SyntaxKind> {
self.source.peek_kind()
}
}
We’ve removed the lexemes
and cursor
fields and their references from Parser
. These have been replaced by calls to methods on Source
, plus a few minor changes that these modifications require. Let’s implement the missing methods on Source
:
// source.rs
use crate::lexer::{Lexeme, SyntaxKind};
// snip
impl<'l, 'input> Source<'l, 'input> {
pub(super) fn new(lexemes: &'l [Lexeme<'input>]) -> Self {
Self { lexemes, cursor: 0 }
}
pub(super) fn next_lexeme(&mut self) -> Option<&'l Lexeme<'input>> {
self.eat_whitespace();
let lexeme = self.lexemes.get(self.cursor)?;
self.cursor += 1;
Some(lexeme)
}
pub(super) fn peek_kind(&mut self) -> Option<SyntaxKind> {
self.eat_whitespace();
self.peek_kind_raw()
}
fn eat_whitespace(&mut self) {
while self.peek_kind_raw() == Some(SyntaxKind::Whitespace) {
self.cursor += 1;
}
}
fn peek_kind_raw(&self) -> Option<SyntaxKind> {
self.lexemes
.get(self.cursor)
.map(|Lexeme { kind, .. }| *kind)
}
}
The code here is almost identical to that of Parser
’s helper methods prior to the change we made to them just a moment ago.
$ cargo t -q
running 32 tests
................................
test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Thank you for reading to the end! In the next part we’ll add support for comments.