return to main site

Part Thirteen: Whitespace & Events

The simple but annoying approach

We could check for the presence of whitespace after every token:

// do some stuff

if p.peek() == Some(SyntaxKind::Whitespace) {
    p.bump();
}

// do some more stuff

It gets annoying to type that out each time, so we could define a eat_whitespace method:

// Only hypothetical; don’t write this!

impl Parser {
    pub(crate) fn eat_whitespace(&mut self) {
        if p.peek() == Some(SyntaxKind::Whitespace) {
            p.bump();
        }
    }
}

Nevertheless, being forced to write out a p.eat_whitespace() call every single time we add a token to a branch is annoying.

The complex but invisible approach

Instead of manually handling whitespace, we can create our own abstraction over Rowan in the form of events – every time Parser::start_node is called, for example, we’ll push to a Vec of Event enums. The same goes for all the other methods on Parser that interact with Rowan – each of them gets its own Event enum variant. After we’re done with parsing, we can construct a tree by applying these events to a GreenNodeBuilder. Crucially, it is at this step that we locate the whitespace tokens missing from the parser’s events, allowing us to add them to the syntax tree as necessary. This way we can handle whitespace in one place, divorcing it from the parser.

Events can make our parser faster if we have a previous parse tree lying around; rather than constructing a new parse tree every time, we can patch an existing one with the events of the new parse. Another benefit that events bring is their decoupling of the parser from the syntax tree implementation. This makes both the parser and especially the syntax tree easier to modify in future. It also gives us the opportunity to share the parser between multiple syntax tree implementations.

Making our parser event-based

Let’s start by defining what an event is:

// parser.rs

mod event;
mod expr;
// src/parser/event.rs

#[derive(Debug)]
pub(super) enum Event {}

The only things our parser does that could become events are:

Let’s add variants to Event for each of these:

use crate::lexer::SyntaxKind;
use rowan::SmolStr;

#[derive(Debug)]
pub(super) enum Event {
    StartNode { kind: SyntaxKind },
    StartNodeAt { kind: SyntaxKind, checkpoint: usize },
    AddToken { kind: SyntaxKind, text: SmolStr },
    FinishNode,
}

Let’s now adjust the helper methods on our parser to add events instead of taking action directly. We won’t need the builder field on Parser anymore either:

// parser.rs

mod event;
mod expr;

use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use std::iter::Peekable;

pub struct Parser<'a> {
    lexer: Peekable<Lexer<'a>>,
    events: Vec<Event>,
}

impl<'a> Parser<'a> {
    pub fn new(input: &'a str) -> Self {
        Self {
            lexer: Lexer::new(input).peekable(),
            events: Vec::new(),
        }
    }

    // snip

    fn start_node(&mut self, kind: SyntaxKind) {
        self.events.push(Event::StartNode { kind });
    }

    fn start_node_at(&mut self, checkpoint: usize, kind: SyntaxKind) {
        self.events.push(Event::StartNodeAt { kind, checkpoint });
    }

    fn finish_node(&mut self) {
        self.events.push(Event::FinishNode);
    }

    fn bump(&mut self) {
        let (kind, text) = self.lexer.next().unwrap();

        self.events.push(Event::AddToken {
            kind,
            text: text.into(),
        });
    }

    fn checkpoint(&self) -> usize {
        self.events.len()
    }

    // snip
}

Now that our parser is constructing events instead of a parse tree, we need a way to get a parse tree out of that Vec<Event>:

mod event;
mod expr;
mod sink;

use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;
use std::iter::Peekable;

// snip

impl<'a> Parser<'a> {
    // snip

    pub fn parse(mut self) -> Parse {
        self.start_node(SyntaxKind::Root);
        expr(&mut self);
        self.finish_node();

        let sink = Sink::new(self.events);

        Parse {
            green_node: sink.finish(),
        }
    }

    // snip
}
// src/parser/sink.rs

use super::event::Event;
use crate::syntax::EldiroLanguage;
use rowan::{Checkpoint, GreenNode, GreenNodeBuilder, Language};

pub(super) struct Sink {
    builder: GreenNodeBuilder<'static>,
    events: Vec<Event>,
}

impl Sink {
    pub(super) fn new(events: Vec<Event>) -> Self {
        Self {
            builder: GreenNodeBuilder::new(),
            events,
        }
    }

    pub(super) fn finish(mut self) -> GreenNode {
        for event in self.events {
            match event {
                Event::StartNode { kind } => {
                    self.builder.start_node(EldiroLanguage::kind_to_raw(kind))
                }
                Event::StartNodeAt { kind, checkpoint } => self
                    .builder
                    .start_node_at(Checkpoint(checkpoint), EldiroLanguage::kind_to_raw(kind)),
                Event::AddToken { kind, text } => {
                    self.builder.token(EldiroLanguage::kind_to_raw(kind), text)
                }
                Event::FinishNode => self.builder.finish_node(),
            }
        }

        self.builder.finish()
    }
}

What we’ve done here is recreate what all those helper methods on Parser did before we modified them to use events; for example, when we encounter the FinishNode event we call self.builder.finish_node(), which is what we did earlier in Parser::finish_node. The code doesn’t compile, though, because the usize that rowan::Checkpoint contains isn’t public, so we can’t create an instance manually. There isn’t any way for us to create a Checkpoint, which means we’ll have to adjust our code to avoid using GreenNodeBuilder::start_node_at.

Currently we rely on Rowan to wrap all the nodes added since the last call to checkpoint; these nodes are wrapped in the SyntaxKind the event has specified. Rather than doing this, we can instead ‘rewrite history’ by moving the StartNodeAt event to earlier in events where checkpoint was called:

// event.rs

#[derive(Debug, Clone)]
pub(super) enum Event {
    // snip
}
// sink.rs

use super::event::Event;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};

// snip

impl Sink {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        let mut reordered_events = self.events.clone();

        for (idx, event) in self.events.into_iter().enumerate() {
            if let Event::StartNodeAt { kind, checkpoint } = event {
                reordered_events.remove(idx);
                reordered_events.insert(checkpoint, Event::StartNode { kind });
            }
        }

        for event in reordered_events {
            match event {
                Event::StartNode { kind } => {
                    self.builder.start_node(EldiroLanguage::kind_to_raw(kind))
                }
                Event::StartNodeAt { .. } => unreachable!(),
                Event::AddToken { kind, text } => {
                    self.builder.token(EldiroLanguage::kind_to_raw(kind), text)
                }
                Event::FinishNode => self.builder.finish_node(),
            }
        }

        self.builder.finish()
    }
}

Note how we modify a clone of the original events because we can’t mutate something while we’re iterating over it (you can’t obtain a unique borrow to something while a shared borrow is outstanding).

Preparation

Let’s now get ready to implement automatic consumption of whitespace. The source of lexemes in our parser is currently a Peekable<Lexer<'_>>; our sink also needs to have its own copy so that it can reconstruct where the whitespace that is missing from events is located. Creating the Peekable<Lexer<'_>> twice would not be ideal, since it would mean unnecessarily lexing the input twice. Instead, we should lex the input just once and collect the results, giving both the parser and the sink a reference to the resulting Vec<(SyntaxKind, &str)>.

The first modification we should make is to create a separate parse function that handles lexing, as well as the creation of the parser as well as the sink:

// parser.rs

pub fn parse(input: &str) -> Parse {
    let lexemes: Vec<_> = Lexer::new(input).collect();
    let parser = Parser::new(&lexemes);
    let events = parser.parse();
    let sink = Sink::new(&lexemes, events);

    Parse {
        green_node: sink.finish(),
    }
}

Now Parser doesn’t need to be pub, since it’s abstracted over by parse:

struct Parser<'a> {
    lexer: Peekable<Lexer<'a>>,
    events: Vec<Event>,
}

While we’re here we may as well replace lexer with a lexemes field and a cursor to represent what point we’re up to:

struct Parser<'l, 'input> {
    lexemes: &'l [(SyntaxKind, &'input str)],
    cursor: usize,
    events: Vec<Event>,
}

We have to update some of Parser’s helper functions, as well as its constructor:

impl<'l, 'input> Parser<'l, 'input> {
    fn new(lexemes: &'l [(SyntaxKind, &'input str)]) -> Self {
        Self {
            lexemes,
            cursor: 0,
            events: Vec::new(),
        }
    }

    fn parse(mut self) -> Vec<Event> {
        self.start_node(SyntaxKind::Root);
        expr(&mut self);
        self.finish_node();

        self.events
    }

    // snip

    fn bump(&mut self) {
        let (kind, text) = self.lexemes[self.cursor];

        self.cursor += 1;
        self.events.push(Event::AddToken {
            kind,
            text: text.into(),
        });
    }

    // snip

    fn peek(&self) -> Option<SyntaxKind> {
        self.lexemes.get(self.cursor).map(|(kind, _)| *kind)
    }
}

Next we have to update Sink to hold its own copy of the lexemes:

// sink.rs

use super::event::Event;
use crate::lexer::SyntaxKind;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};

pub(super) struct Sink<'l, 'input> {
    builder: GreenNodeBuilder<'static>,
    lexemes: &'l [(SyntaxKind, &'input str)],
    events: Vec<Event>,
}

impl<'l, 'input> Sink<'l, 'input> {
    pub(super) fn new(lexemes: &'l [(SyntaxKind, &'input str)], events: Vec<Event>) -> Self {
        Self {
            builder: GreenNodeBuilder::new(),
            lexemes,
            events,
        }
    }

    // snip
}

Our CLI isn’t compiling because it refers to Parser, which we made private earlier. Let’s replace this reference with our new parse function:

// crates/eldiro-cli/src/main.rs

use eldiro::parser::parse;
use std::io::{self, Write};

fn main() -> io::Result<()> {
    let stdin = io::stdin();
    let mut stdout = io::stdout();

    let mut input = String::new();

    loop {
        write!(stdout, "→ ")?;
        stdout.flush()?;

        stdin.read_line(&mut input)?;

        let parse = parse(&input);
        println!("{}", parse.debug_tree());

        input.clear();
    }
}

We have an unused import in crates/eldiro/src/parser.rs:

mod event;
mod expr;
mod sink;

use crate::lexer::{Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;

Finally, the last error remaining is the check function, which we need to update to use parse:

#[cfg(test)]
fn check(input: &str, expected_tree: expect_test::Expect) {
    let parse = parse(input);
    expected_tree.assert_eq(&parse.debug_tree());
}
$ cargo t -q
running 27 tests
...........................
test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s abstract away the combination of SyntaxKind and &str into its own structure, since the two are frequently being passed together (something called a data clump):

// lexer.rs

impl<'a> Iterator for Lexer<'a> {
    type Item = Lexeme<'a>;

    fn next(&mut self) -> Option<Self::Item> {
        let kind = self.inner.next()?;
        let text = self.inner.slice();

        Some(Self::Item { kind, text })
    }
}

#[derive(Debug, PartialEq)]
pub(crate) struct Lexeme<'a> {
    pub(crate) kind: SyntaxKind,
    pub(crate) text: &'a str,
}

// snip

#[cfg(test)]
mod tests {
    // snip

    fn check(input: &str, kind: SyntaxKind) {
        let mut lexer = Lexer::new(input);
        assert_eq!(lexer.next(), Some(Lexeme { kind, text: input }));
    }

    // snip
}
// parser.rs

use crate::lexer::{Lexeme, Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;

// snip

struct Parser<'l, 'input> {
    lexemes: &'l [Lexeme<'input>],
    cursor: usize,
    events: Vec<Event>,
}

impl<'l, 'input> Parser<'l, 'input> {
    fn new(lexemes: &'l [Lexeme<'input>]) -> Self {
        Self {
            lexemes,
            cursor: 0,
            events: Vec::new(),
        }
    }

    // snip

    fn bump(&mut self) {
        let Lexeme { kind, text } = self.lexemes[self.cursor];

        self.cursor += 1;
        self.events.push(Event::AddToken {
            kind,
            text: text.into(),
        });
    }

    // snip

    fn peek(&self) -> Option<SyntaxKind> {
        self.lexemes
            .get(self.cursor)
            .map(|Lexeme { kind, .. }| *kind)
    }
}
// sink.rs

use super::event::Event;
use crate::lexer::Lexeme;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};

pub(super) struct Sink<'l, 'input> {
    builder: GreenNodeBuilder<'static>,
    lexemes: &'l [Lexeme<'input>],
    events: Vec<Event>,
}

impl<'l, 'input> Sink<'l, 'input> {
    pub(super) fn new(lexemes: &'l [Lexeme<'input>], events: Vec<Event>) -> Self {
        Self {
            builder: GreenNodeBuilder::new(),
            lexemes,
            events,
        }
    }

    // snip
}

Implementation of whitespace-skipping

Now the machinery is in place to have the parser automatically skip whitespace it encounters. Let’s write a test so we know when we get it working:

// parser.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_whitespace() {
        check(
            "   ",
            expect![[r#"
Root@0..3
  Whitespace@0..3 "   ""#]],
        );
    }
}

All the functions that interact with lexemes need to skip past any whitespace they find. Note that we don’t add events for whitespace; instead, we just ignore it and let the sink take care of it:

impl<'l, 'input> Parser<'l, 'input> {
    // snip

    fn bump(&mut self) {
        self.eat_whitespace();

        let Lexeme { kind, text } = self.lexemes[self.cursor];

        self.cursor += 1;
        self.events.push(Event::AddToken {
            kind,
            text: text.into(),
        });
    }

    // snip

    fn peek(&mut self) -> Option<SyntaxKind> {
        self.eat_whitespace();
        self.peek_raw()
    }

    fn eat_whitespace(&mut self) {
        while self.peek_raw() == Some(SyntaxKind::Whitespace) {
            self.cursor += 1;
        }
    }

    fn peek_raw(&self) -> Option<SyntaxKind> {
        self.lexemes
            .get(self.cursor)
            .map(|Lexeme { kind, .. }| *kind)
    }
}

To make this work we need to update the sink to automatically add any whitespace it encounters. Let’s start by adding a cursor to the sink:

// sink.rs

pub(super) struct Sink<'l, 'input> {
    builder: GreenNodeBuilder<'static>,
    lexemes: &'l [Lexeme<'input>],
    cursor: usize,
    events: Vec<Event>,
}

impl<'l, 'input> Sink<'l, 'input> {
    pub(super) fn new(lexemes: &'l [Lexeme<'input>], events: Vec<Event>) -> Self {
        Self {
            builder: GreenNodeBuilder::new(),
            lexemes,
            cursor: 0,
            events,
        }
    }

    // snip
}

Let’s bump this cursor every time a token is added:

use super::event::Event;
use crate::lexer::{Lexeme, SyntaxKind};
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language, SmolStr};

// snip

impl<'l, 'input> Sink<'l, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        // snip

        for event in reordered_events {
            match event {
                // snip
                Event::AddToken { kind, text } => self.token(kind, text),
                // snip
            }
        }

        // snip
    }

    fn token(&mut self, kind: SyntaxKind, text: SmolStr) {
        self.builder.token(EldiroLanguage::kind_to_raw(kind), text);
        self.cursor += 1;
    }
}

This gives us a compiler error:

error[E0382]: borrow of partially moved value: `self`
   --> crates/eldiro/src/parser/sink.rs:39:51
    |
26  |         for (idx, event) in self.events.into_iter().enumerate() {
    |                                         ----------- `self.events` partially moved due to this method call
...
39  |                 Event::AddToken { kind, text } => self.token(kind, text),
    |                                                   ^^^^ value borrowed here after partial move
    |
note: this function consumes the receiver `self` by taking ownership of it, which moves `self.events`
   --> /home/me/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/collect.rs:232:18
    |
232 |     fn into_iter(self) -> Self::IntoIter;
    |                  ^^^^
    = note: partial move occurs because `self.events` has type `Vec<Event>`, which does not implement the `Copy` trait

We can appease the borrow checker by replacing the .into_iter() call with .iter() and sprinkling in a few *s here and there:

impl<'l, 'input> Sink<'l, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        // snip

        for (idx, event) in self.events.iter().enumerate() {
            if let Event::StartNodeAt { kind, checkpoint } = event {
                reordered_events.remove(idx);
                reordered_events.insert(*checkpoint, Event::StartNode { kind: *kind });
            }
        }

        // snip
    }

    // snip
}

Now that Eldiro is compiling, we can finally get to the interesting part: whitespace skipping. Let’s implement an eat_whitespace method and call it after processing each event:

impl<'l, 'input> Sink<'l, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        // snip

        for event in reordered_events {
            match event {
                // snip
            }

            self.eat_whitespace();
        }

        // snip
    }

    fn eat_whitespace(&mut self) {
        while let Some(lexeme) = self.lexemes.get(self.cursor) {
            if lexeme.kind != SyntaxKind::Whitespace {
                break;
            }

            self.token(lexeme.kind, lexeme.text.into());
        }
    }

    // snip
}
$ cargo t -q
running 28 tests
............................
test result: ok. 28 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

We should write some tests for parsing numbers surrounded by whitespace on different sides, since the only test we currently have related to whitespace only contains whitespace, and nothing else:

// expr.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_number_preceded_by_whitespace() {
        check(
            "   9876",
            expect![[r#"
Root@0..7
  Whitespace@0..3 "   "
  Number@3..7 "9876""#]],
        );
    }

    #[test]
    fn parse_number_followed_by_whitespace() {
        check(
            "999   ",
            expect![[r#"
Root@0..6
  Number@0..3 "999"
  Whitespace@3..6 "   ""#]],
        );
    }

    #[test]
    fn parse_number_surrounded_by_whitespace() {
        check(
            " 123     ",
            expect![[r#"
Root@0..9
  Whitespace@0..1 " "
  Number@1..4 "123"
  Whitespace@4..9 "     ""#]],
        );
    }

    // snip
}
$ cargo t -q
running 31 tests
...............................
test result: ok. 31 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s write one more test to make sure whitespace in binary expressions is working:

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_binary_expression_with_whitespace() {
        check(
            " 1 +   2* 3 ",
            expect![[r#"
Root@0..12
  Whitespace@0..1 " "
  BinaryExpr@1..12
    Number@1..2 "1"
    Whitespace@2..3 " "
    Plus@3..4 "+"
    Whitespace@4..7 "   "
    BinaryExpr@7..12
      Number@7..8 "2"
      Star@8..9 "*"
      Whitespace@9..10 " "
      Number@10..11 "3"
      Whitespace@11..12 " ""#]],
        );
    }

    // snip
}
$ cargo t -q
running 32 tests
................................
test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Cool! And there you have it: we’ve managed to implement automatic whitespace skipping without having to touch the actual parser’s code, only the helper methods and sink.

Refactoring

It’s ugly how the parser has to manage a cursor that it has to manually advance through the input, skipping whitespace as necessary. Instead, we should extract this into another type; let’s call it Source to stay consistent with Sink:

// parser.rs

mod event;
mod expr;
mod sink;
mod source;

use crate::lexer::{Lexeme, Lexer, SyntaxKind};
use crate::syntax::SyntaxNode;
use event::Event;
use expr::expr;
use rowan::GreenNode;
use sink::Sink;
use source::Source;
// src/parser/source.rs

use crate::lexer::Lexeme;

pub(super) struct Source<'l, 'input> {
    lexemes: &'l [Lexeme<'input>],
    cursor: usize,
}

Let’s update Parser to use Source with the API we wish we had; this way we avoid the common pitfall of writing an API, trying to use it, only to find that it needs to be changed:

// parser.rs

struct Parser<'l, 'input> {
    source: Source<'l, 'input>,
    events: Vec<Event>,
}

impl<'l, 'input> Parser<'l, 'input> {
    fn new(lexemes: &'l [Lexeme<'input>]) -> Self {
        Self {
            source: Source::new(lexemes),
            events: Vec::new(),
        }
    }

    // snip

    fn bump(&mut self) {
        let Lexeme { kind, text } = self.source.next_lexeme().unwrap();

        self.events.push(Event::AddToken {
            kind: *kind,
            text: (*text).into(),
        });
    }

    // snip

    fn peek(&mut self) -> Option<SyntaxKind> {
        self.source.peek_kind()
    }
}

We’ve removed the lexemes and cursor fields and their references from Parser. These have been replaced by calls to methods on Source, plus a few minor changes that these modifications require. Let’s implement the missing methods on Source:

// source.rs

use crate::lexer::{Lexeme, SyntaxKind};

// snip

impl<'l, 'input> Source<'l, 'input> {
    pub(super) fn new(lexemes: &'l [Lexeme<'input>]) -> Self {
        Self { lexemes, cursor: 0 }
    }

    pub(super) fn next_lexeme(&mut self) -> Option<&'l Lexeme<'input>> {
        self.eat_whitespace();

        let lexeme = self.lexemes.get(self.cursor)?;
        self.cursor += 1;

        Some(lexeme)
    }

    pub(super) fn peek_kind(&mut self) -> Option<SyntaxKind> {
        self.eat_whitespace();
        self.peek_kind_raw()
    }

    fn eat_whitespace(&mut self) {
        while self.peek_kind_raw() == Some(SyntaxKind::Whitespace) {
            self.cursor += 1;
        }
    }

    fn peek_kind_raw(&self) -> Option<SyntaxKind> {
        self.lexemes
            .get(self.cursor)
            .map(|Lexeme { kind, .. }| *kind)
    }
}

The code here is almost identical to that of Parser’s helper methods prior to the change we made to them just a moment ago.

$ cargo t -q
running 32 tests
................................
test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Thank you for reading to the end! In the next part we’ll add support for comments.