return to main site

Part Sixteen: Refactoring

Refactoring expr_binding_power

The right-hand side of the let lhs = ... in expr_binding_power is getting quite long, so let’s extract it to its own function:

// expr.rs

use super::marker::CompletedMarker;
use super::Parser;
use crate::lexer::SyntaxKind;

// snip

fn expr_binding_power(p: &mut Parser, minimum_binding_power: u8) {
    let mut lhs = if let Some(lhs) = lhs(p) {
        lhs
    } else {
        return; // we’ll handle errors later.
    };

    // snip
}

fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
    let cm = match p.peek() {
        Some(SyntaxKind::Number) => {
            // snip
        }
        // all the other arms are also here, unchanged
        _ => return None,
    };

    Some(cm)
}
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s also extract the parsers for each of the match arms in lhs:

fn lhs(p: &mut Parser) -> Option<CompletedMarker> {
    let cm = match p.peek() {
        Some(SyntaxKind::Number) => literal(p),
        Some(SyntaxKind::Ident) => variable_ref(p),
        Some(SyntaxKind::Minus) => prefix_expr(p),
        Some(SyntaxKind::LParen) => paren_expr(p),
        _ => return None,
    };

    Some(cm)
}

// snip

fn literal(p: &mut Parser) -> CompletedMarker {
    assert_eq!(p.peek(), Some(SyntaxKind::Number));

    let m = p.start();
    p.bump();
    m.complete(p, SyntaxKind::Literal)
}

fn variable_ref(p: &mut Parser) -> CompletedMarker {
    assert_eq!(p.peek(), Some(SyntaxKind::Ident));

    let m = p.start();
    p.bump();
    m.complete(p, SyntaxKind::VariableRef)
}

fn prefix_expr(p: &mut Parser) -> CompletedMarker {
    assert_eq!(p.peek(), Some(SyntaxKind::Minus));

    let m = p.start();

    let op = PrefixOp::Neg;
    let ((), right_binding_power) = op.binding_power();

    // Eat the operator’s token.
    p.bump();

    expr_binding_power(p, right_binding_power);

    m.complete(p, SyntaxKind::PrefixExpr)
}

fn paren_expr(p: &mut Parser) -> CompletedMarker {
    assert_eq!(p.peek(), Some(SyntaxKind::LParen));

    let m = p.start();

    p.bump();
    expr_binding_power(p, 0);

    assert_eq!(p.peek(), Some(SyntaxKind::RParen));
    p.bump();

    m.complete(p, SyntaxKind::ParenExpr)
}

Writing out that assert_eq! at the start of each subparser is getting annoying. Let’s write a helper method on Parser to make this easier:

// parser.rs

impl<'l, 'input> Parser<'l, 'input> {
    // snip

    fn at(&mut self, kind: SyntaxKind) -> bool {
        self.peek() == Some(kind)
    }

    // snip
}

We can now update all those assert_eq!s:

// expr.rs

fn literal(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::Number));

    // snip
}

fn variable_ref(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::Ident));

    // snip
}

fn prefix_expr(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::Minus));

    // snip
}

fn paren_expr(p: &mut Parser) -> CompletedMarker {
    assert!(p.at(SyntaxKind::LParen));

    // snip

    assert!(p.at(SyntaxKind::RParen));
    p.bump();

    // snip
}
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

‘Lexeme’ versus ‘token’

I was under the impression that a ‘token’ is an identifier of a given chunk of the input (in our case a SyntaxKind), and that a ‘lexeme’ is a token plus the text that token applies to. Once again it seems that my assumptions about terminology are incorrect: although this StackOverflow answer agrees with my definition of ‘token’, it explains that the word ‘token’ can also be used to describe what I’ve been calling a ‘lexeme’. My definition of ‘lexeme’ is incorrect; the answer says that ‘lexeme’ refers to the text a token applies to, and that only.

Use your editor’s project-wide search and replace to rename Lexeme to Token, lexeme to token and 'l to 't. Make sure to run cargo fmt afterwards to maintain formatting.

See the commit where this change is made if you have any trouble.

Removing unneeded fields from Event

When we want to add a token to the current branch, we use this event:

Event::AddToken {
    kind: SyntaxKind::Foo,
    text: "bar".into(),
}

We get the SyntaxKind and text of this token from the source, which stores an array of tokens internally. Once parsing is complete, the sink goes through each event and processes it. The sink also stores the same array of tokens.

This means that the the parser doesn’t need to tell the sink the SyntaxKind and text of each token it’s adding, since the sink already has the data to work this out. Let’s remove the unnecessary fields from Event::AddToken:

// event.rs

use crate::lexer::SyntaxKind;

#[derive(Debug, Clone, PartialEq)]
pub(super) enum Event {
    StartNode {
        kind: SyntaxKind,
        forward_parent: Option<usize>,
    },
    AddToken,
    FinishNode,
    Placeholder,
}

This has reduced the size of Event from 32 bytes to 24 bytes. Let’s update Parser::bump to match:

// parser.rs

impl<'t, 'input> Parser<'t, 'input> {
    // snip

    fn bump(&mut self) {
        self.source.next_token().unwrap();
        self.events.push(Event::AddToken);
    }

    // snip
}

We also need to change Sink:

// sink.rs

use super::event::Event;
use crate::lexer::Token;
use crate::syntax::EldiroLanguage;
use rowan::{GreenNode, GreenNodeBuilder, Language};
use std::mem;

// snip

impl<'t, 'input> Sink<'t, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        for idx in 0..self.events.len() {
            match mem::replace(&mut self.events[idx], Event::Placeholder) {
                // snip
                Event::AddToken => self.token(),
                // snip
            }

            self.eat_trivia();
        }

        self.builder.finish()
    }

    fn eat_trivia(&mut self) {
        while let Some(token) = self.tokens.get(self.cursor) {
            if !token.kind.is_trivia() {
                break;
            }

            self.token();
        }
    }

    fn token(&mut self) {
        let Token { kind, text } = self.tokens[self.cursor];

        self.builder
            .token(EldiroLanguage::kind_to_raw(kind), text.into());

        self.cursor += 1;
    }
}
$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Binary/infix and unary/prefix

I’ve used these terms inconsistently throughout the series, and thought it might be a good time to clean this up.

I’ve used the ‘binary’ terminology in SyntaxKind, but used ‘infix’ for the type representing these binary operators in expr.rs. I’ve used the ‘prefix’ terminology in both SyntaxKind and expr.rs.

To stay consistent we should name things after the position of the operator, or the number of operands. The parser has to worry about the position, but future parts of Eldiro (such as an interpreter) have to worry about the number of operands. To convey this we’ll rename SyntaxKind::BinaryExpr to SyntaxKind::InfixExpr, InfixOp to BinaryOp and PrefixOp to UnaryOp. Make sure you change the tests too! If you get stuck feel free to look at the relevant commit.