So far we have only used two different data structures to represent code:
- abstract syntax trees
- concrete syntax trees
An abstract syntax tree is a data structure that represents the bare minimum about a chunk of code. We used ASTs in the early parts of this series prior to the Rowan rewrite. For example, here’s an AST that represents variable definitions:
struct VariableDef {
name: String,
value: Expr,
}
enum Expr {
// snip
}
Note how it doesn’t include the let
token or the =
, and ignores trivia like whitespace and comments; it’s abstract, and only contains the information necessary to implement, say, a compiler. Although this makes them convenient to work with, it also prevents us from implementing some IDE features on top of them.
Automatic refactorings are a good example. How are meant to implement a feature that relies on knowing and being able to modify source code when we don’t know the exact text of the code being analysed?
Concrete syntax trees, on the other hand, include the text of the source code in full. We’ve used CSTs whenever we’ve used Rowan, as the trees it produces can reproduce the text they were created from. Here’s a CST (in this case from Rowan) that represents a binary expression:
Root@0..5
InfixExpr@0..5
Literal@0..2
Number@0..1 "1"
Whitespace@1..2 " "
Plus@2..3 "+"
Whitespace@3..4 " "
Literal@4..5
Number@4..5 "1"
It includes the entire input, whitespace and all. An interesting feature of Rowan’s CSTs is that they are untyped, meaning that nodes of differing kinds (e.g. VariableDef
and InfixExpr
) have the same type – SyntaxNode
. Although this means we can implement features that don’t depend on node kinds more easily than we could otherwise,1 it also means that more complex analysis on Rowan CSTs is painful, since we need to call .kind()
whenever we want to know a node’s kind. We wouldn’t have to do this if nodes of different kinds had different types.
Programming language implementations that use Rowan usually include a typed layer on top of the CST for analysis purposes. Confusingly, this layer is called an AST in spite of its knowledge of the full input text, since it uses an unmodified Rowan CST under the hood. Here’s an example of what one for binary expressions might look like:
struct BinaryExpr(SyntaxNode);
impl BinaryExpr {
fn lhs(&self) -> Option<Expr> {
// do stuff to extract the left-hand side
}
fn rhs(&self) -> Option<Expr> {
// do stuff to extract the right-hand side
}
fn op(&self) -> Option<BinaryOp> {
// do stuff to extract the operator
}
}
Note how all these methods return Option
s – this is because the parser supports error recovery and allows incomplete trees; as such, these methods cannot be sure that the thing they are trying to extract is actually present.
Note also how the API provided by this is similar to the simple struct
from earlier, with the distinction that components are accessed through methods that return Option
, rather than non-optional fields.
Finally, programming language implementations with Rowan usually include one more, higher-level representation: the high-level intermediate representation, or HIR. These are equivalent to the classical ASTs that we used before the rewrite. For comparison, here’s what a HIR data structure that represents binary expressions could look like:
struct BinaryExpr {
lhs: Option<Expr>,
rhs: Option<Expr>,
op: Option<BinaryOp>,
}
HIRs can make it easier to implement the parts of a programming language that don’t have to worry about original source code text, because this very property – the ability to differ from the original input – allows for the use of lowering.
Lowering is the process of changing syntactic sugar, or shorthand, into its lower-level form. For example, a while
loop is syntactic sugar for a loop
with an if
statement:
while !should_exit() {
handle_connection();
}
// is sugar for
loop {
if should_exit() {
break;
}
handle_connection();
}
Imagine you’re implementing a Rust type checker – if you don’t lower while
loops, you need to worry about type-checking both loop
statements and while
loops. However, if you lower while
loops into loop
statements, you’d only have to implement type-checking for loop
. The same principle can be applied to lots of other components of programming language tooling.
Lowering isn’t possible with a Rowan AST, as it has to represent the input text losslessly. It is possible with a HIR, though, since they don’t have to correspond with the input text.
Eldiro will use three different code representations:
- CST (Rowan), constructed during parsing
- AST (typed layer on top of untyped CST) for anything that needs to deal with source code directly (e.g. refactorings)
- HIR (abstract, lowered representation) for everything else
Implementing an AST
Although we could stick this in an existing crate, it doesn’t really belong in any of them. Let’s create a new one:
$ cargo new --lib crates/ast
Created library `crates/ast` package
This crate will need to use SyntaxNode
from the syntax
crate, so we’ll add that as a dependency:
# crates/ast/Cargo.toml
[dependencies]
syntax = {path = "../syntax"}
Let’s start by writing the skeleton of a variable definition’s AST node:
// lib.rs
use syntax::SyntaxNode;
#[derive(Debug)]
pub struct VariableDef(SyntaxNode);
impl VariableDef {
pub fn name(&self) -> Option<?> {
todo!()
}
pub fn value(&self) -> Option<?> {
todo!()
}
}
Let’s look at a typical VariableDef
CST to remind us of its structure:
Root@0..13
VariableDef@0..13
LetKw@0..3 "let"
Whitespace@3..4 " "
Ident@4..7 "foo"
Whitespace@7..8 " "
Equals@8..9 "="
Whitespace@9..10 " "
VariableRef@10..13
Ident@10..13 "bar"
We would want VariableDef::name
to return that Ident
, a token, not a node, and VariableDef::value
should return the VariableRef
node. We have a type for nodes from the Rowan syntax tree – SyntaxNode
. We don’t have a type for tokens from the Rowan CST though; let’s create that now:
// crates/syntax/src/lib.rs
pub type SyntaxNode = rowan::SyntaxNode<EldiroLanguage>;
pub type SyntaxToken = rowan::SyntaxToken<EldiroLanguage>;
We can now fill in those missing types from the AST:
// crates/ast/src/lib.rs
use syntax::{SyntaxNode, SyntaxToken};
// snip
impl VariableDef {
pub fn name(&self) -> Option<SyntaxToken> {
todo!()
}
pub fn value(&self) -> Option<SyntaxNode> {
todo!()
}
}
To determine a VariableDef
’s name, we would want to iterate over its child tokens, stoping at the first one whose kind is Ident
. Rowan doesn’t give us a way to iterate over only the child tokens of a node, so we’ll iterate over the tokens and nodes (known to Rowan as SyntaxElement
s), filtering out the nodes:
use syntax::{SyntaxKind, SyntaxNode, SyntaxToken};
// snip
impl VariableDef {
pub fn name(&self) -> Option<SyntaxToken> {
self.0
.children_with_tokens()
.filter_map(SyntaxElement::into_token)
.find(|token| token.kind() == SyntaxKind::Ident)
}
// snip
}
We need to define our own version of SyntaxElement
:
// crates/syntax/src/lib.rs
pub type SyntaxNode = rowan::SyntaxNode<EldiroLanguage>;
pub type SyntaxToken = rowan::SyntaxToken<EldiroLanguage>;
pub type SyntaxElement = rowan::SyntaxElement<EldiroLanguage>;
And import it from ast
:
// crates/ast/src/lib.rs
use syntax::{SyntaxElement, SyntaxKind, SyntaxNode, SyntaxToken};
Let’s define VariableDef::value
by finding the first child node that is an expression. For this we can use the non-existent Expr::cast
function:
impl VariableDef {
// snip
pub fn value(&self) -> Option<Expr> {
self.0.children().find_map(Expr::cast)
}
}
To continue we have to define Expr
:
#[derive(Debug)]
pub enum Expr {
BinaryExpr(BinaryExpr),
Literal(Literal),
ParenExpr(ParenExpr),
UnaryExpr(UnaryExpr),
VariableRef(VariableRef),
}
Let’s define each of its variant types:
#[derive(Debug)]
pub struct BinaryExpr(SyntaxNode);
#[derive(Debug)]
pub struct Literal(SyntaxNode);
#[derive(Debug)]
pub struct ParenExpr(SyntaxNode);
#[derive(Debug)]
pub struct UnaryExpr(SyntaxNode);
#[derive(Debug)]
pub struct VariableRef(SyntaxNode);
The only remaining compile error is Expr::cast
, which we can define:
impl Expr {
pub fn cast(node: SyntaxNode) -> Option<Self> {
let result = match node.kind() {
SyntaxKind::InfixExpr => Self::BinaryExpr(BinaryExpr(node)),
SyntaxKind::Literal => Self::Literal(Literal(node)),
SyntaxKind::ParenExpr => Self::ParenExpr(ParenExpr(node)),
SyntaxKind::PrefixExpr => Self::UnaryExpr(UnaryExpr(node)),
SyntaxKind::VariableRef => Self::VariableRef(VariableRef(node)),
_ => return None,
};
Some(result)
}
}
We’re still missing an AST node, namely Stmt
. Its definition is similar to Expr
’s:
#[derive(Debug)]
pub enum Stmt {
VariableDef(VariableDef),
Expr(Expr),
}
impl Stmt {
pub fn cast(node: SyntaxNode) -> Option<Self> {
let result = match node.kind() {
SyntaxKind::VariableDef => Self::VariableDef(VariableDef(node)),
_ => Self::Expr(Expr::cast(node)?),
};
Some(result)
}
}
Next, let’s add methods to BinaryExpr
to extract the left-hand side, right-hand side and operator:
impl BinaryExpr {
pub fn lhs(&self) -> Option<Expr> {
self.0.children().find_map(Expr::cast)
}
pub fn rhs(&self) -> Option<Expr> {
self.0.children().filter_map(Expr::cast).nth(1)
}
pub fn op(&self) -> Option<SyntaxToken> {
self.0
.children_with_tokens()
.filter_map(SyntaxElement::into_token)
.find(|token| {
matches!(
token.kind(),
SyntaxKind::Plus | SyntaxKind::Minus | SyntaxKind::Star | SyntaxKind::Slash,
)
})
}
}
UnaryExpr
also needs equivalents of these:
impl UnaryExpr {
pub fn expr(&self) -> Option<Expr> {
self.0.children().find_map(Expr::cast)
}
pub fn op(&self) -> Option<SyntaxToken> {
self.0
.children_with_tokens()
.filter_map(SyntaxElement::into_token)
.find(|token| token.kind() == SyntaxKind::Minus)
}
}
And ParenExpr
could use a method that extracts the contained expression:
impl ParenExpr {
pub fn expr(&self) -> Option<Expr> {
self.0.children().find_map(Expr::cast)
}
}
VariableRef
and Literal
too:
use smol_str::SmolStr;
use syntax::{SyntaxElement, SyntaxKind, SyntaxNode, SyntaxToken};
impl Literal {
pub fn parse(&self) -> u64 {
self.0.first_token().unwrap().text().parse().unwrap()
}
}
// snip
impl VariableRef {
pub fn name(&self) -> SmolStr {
self.0.first_token().unwrap().text().clone()
}
}
Let’s add the missing smol_str
dependency:
# Cargo.toml
[dependencies]
smol_str = "0.1.17"
syntax = {path = "../syntax"}
Finally, we need to define the root of the AST:
#[derive(Debug)]
pub struct Root(SyntaxNode);
impl Root {
pub fn cast(node: SyntaxNode) -> Option<Self> {
if node.kind() == SyntaxKind::Root {
Some(Self(node))
} else {
None
}
}
}
For any other code to be able to interact with the contents of Root
, it needs to have methods that make its contents accessible. Since the parser currently consumes multiple statements, so we can make those available:
impl Root {
// snip
pub fn stmts(&self) -> impl Iterator<Item = Stmt> {
self.0.children().filter_map(Stmt::cast)
}
}
Let’s try this code out by printing the value of any variable definition we might see in the syntax tree when in the REPL:
// crates/parser/src/lib.rs
impl Parse {
pub fn debug_tree(&self) -> String {
let mut s = String::new();
let tree = format!("{:#?}", self.syntax());
// snip
}
pub fn syntax(&self) -> SyntaxNode {
SyntaxNode::new_root(self.green_node.clone())
}
}
// crates/eldiro/src/main.rs
fn main() -> io::Result<()> {
// snip
loop {
// snip
let parse = parse(&input);
println!("{}", parse.debug_tree());
let root = ast::Root::cast(parse.syntax()).unwrap();
dbg!(root
.stmts()
.filter_map(|stmt| if let ast::Stmt::VariableDef(var_def) = stmt {
Some(var_def.value())
} else {
None
})
.collect::<Vec<_>>());
// snip
}
}
The REPL now has to depend on ast
:
# Cargo.toml
[dependencies]
ast = {path = "../ast"}
parser = {path = "../parser"}
We can now run the REPL:
$ cargo r -q
→ let one = 1 let two = one + one
# snip
[
Some(
Literal(
Literal(
Literal@10..12
Number@10..11 "1"
Whitespace@11..12 " "
,
),
),
),
Some(
BinaryExpr(
BinaryExpr(
InfixExpr@22..32
VariableRef@22..26
Ident@22..25 "one"
Whitespace@25..26 " "
Plus@26..27 "+"
Whitespace@27..28 " "
VariableRef@28..32
Ident@28..31 "one"
Whitespace@31..32 "\n"
,
),
),
),
]
It’s extracted that 1
literal from the first variable, as well as the binary expression from the second variable.
Note how each of the two expressions our REPL is printing out is wrapped in a Some
– this is because it’s possible for a variable definition to not have a value. Although this isn’t allowed by the language, it still happens because the parser recovers from the error and marches on. Let’s see if we get a None
when we type in a variable definition that’s missing a value:
→ let a = let b = 1 let c =
# snip
[
None,
Some(
Literal(
Literal(
Literal@16..18
Number@16..17 "1"
Whitespace@17..18 " "
,
),
),
),
None,
]
As expected, since the first and last variable definitions are missing values, we get None
, Some
and None
.
Implementing a HIR
Now that we have completed the implementation of a lossless AST, we can move on to implementing a HIR. In case you’ve forgotten, a HIR is a lowered, abstract representation of code that doesn’t include syntactic details and doesn’t have to match the input text exactly. As with the AST, we’ll start by letting Cargo generate a new crate for us:
$ cargo new --lib crates/hir
Created library `crates/hir` package
Let’s start with statements and variable definitions:
// crates/hir/src/lib.rs
use smol_str::SmolStr;
#[derive(Debug)]
pub enum Stmt {
VariableDef { name: SmolStr, value: Expr },
Expr(Expr),
}
# Cargo.toml
[dependencies]
smol_str = "0.1.17"
Next we’ll define Expr
:
#[derive(Debug)]
pub enum Expr {
Binary { op: BinaryOp, lhs: Self, rhs: Self },
Literal { n: u64 },
Unary { op: UnaryOp, expr: Self },
VariableRef { var: SmolStr },
}
#[derive(Debug)]
pub enum BinaryOp {
Add,
Sub,
Mul,
Div,
}
#[derive(Debug)]
pub enum UnaryOp {
Neg,
}
We now have the same problem as we had in Part Eight; namely, we have a type that contains itself, and thus has an infinite size. We can fix this by wrapping all recursive usages of Expr
in Box
, which has a fixed size. This eliminates the recursion:
#[derive(Debug)]
pub enum Expr {
Binary {
op: BinaryOp,
lhs: Box<Self>,
rhs: Box<Self>,
},
Literal {
n: u64,
},
Unary {
op: UnaryOp,
expr: Box<Self>,
},
VariableRef {
var: SmolStr,
},
}
This HIR is missing a critical component of our Rowan-based AST: the ability to represent incorrect parser inputs. We could represent the possibility of, say, a binary expression missing its right-hand side by wrapping rhs
in an Option
. We’d have to do this at every usage of Expr
, though. Instead, we’ll add a Missing
variant to Expr
, so all usages of Expr
are effectively optional:
#[derive(Debug)]
pub enum Expr {
Missing,
// snip
}
There’s only one other part of the HIR that can be missing from the AST: VariableDef
’s name
field. A VariableDef
node can be missing a name when the user has only typed the let
keyword, since let
is enough to trigger the creation of a VariableDef
. In spite of this, we shouldn’t make name
optional, because the HIR is used for features like type inference and IDE indexing that rely on the presence of names. If we made names optional, we’d have to handle the case where they don’t exist again and again.
Now that the HIR’s basic data structures are in place, we can implement lowering:
# Cargo.toml
[dependencies]
ast = {path = "../ast"}
smol_str = "0.1.17"
syntax = {path = "../syntax"}
// lib.rs
use smol_str::SmolStr;
use syntax::SyntaxKind;
// snip
impl Stmt {
fn lower(ast: ast::Stmt) -> Option<Self> {
let result = match ast {
ast::Stmt::VariableDef(ast) => Self::VariableDef {
name: ast.name()?.text().clone(),
value: Expr::lower(ast.value()),
},
ast::Stmt::Expr(ast) => Self::Expr(Expr::lower(Some(ast))),
};
Some(result)
}
}
// snip
impl Expr {
fn lower(ast: Option<ast::Expr>) -> Self {
if let Some(ast) = ast {
match ast {
ast::Expr::BinaryExpr(ast) => Self::lower_binary(ast),
ast::Expr::Literal(ast) => Self::Literal { n: ast.parse() },
ast::Expr::ParenExpr(ast) => Expr::lower(ast.expr()),
ast::Expr::UnaryExpr(ast) => Self::lower_unary(ast),
ast::Expr::VariableRef(ast) => Self::VariableRef { var: ast.name() },
}
} else {
Self::Missing
}
}
fn lower_binary(ast: ast::BinaryExpr) -> Self {
let op = match ast.op().unwrap().kind() {
SyntaxKind::Plus => BinaryOp::Add,
SyntaxKind::Minus => BinaryOp::Sub,
SyntaxKind::Star => BinaryOp::Mul,
SyntaxKind::Slash => BinaryOp::Div,
_ => unreachable!(),
};
Self::Binary {
op,
lhs: Box::new(Expr::lower(ast.lhs())),
rhs: Box::new(Expr::lower(ast.rhs())),
}
}
fn lower_unary(ast: ast::UnaryExpr) -> Self {
let op = match ast.op().unwrap().kind() {
SyntaxKind::Minus => UnaryOp::Neg,
_ => unreachable!(),
};
Self::Unary {
op,
expr: Box::new(Expr::lower(ast.expr())),
}
}
}
All that’s left to do for lowering is to write a lower
function that takes the root of an AST, and lowers it to a Vec<Stmt>
:
pub fn lower(ast: ast::Root) -> impl Iterator<Item = Stmt> {
ast.stmts().filter_map(Stmt::lower)
}
Let’s try lowering out by making the REPL print all HIR elements it can find in the input:
# crates/eldiro/Cargo.toml
[dependencies]
ast = {path = "../ast"}
hir = {path = "../hir"}
parser = {path = "../parser"}
// crates/eldiro/src/main.rs
fn main() -> io::Result<()> {
// snip
loop {
// snip
let root = ast::Root::cast(parse.syntax()).unwrap();
dbg!(/* snip */);
dbg!(hir::lower(root).collect::<Vec<_>>());
// snip
}
}
$ cargo r -q
→ let a = 10
# snip
[
VariableDef {
name: "a",
value: Literal {
n: 10,
},
},
]
→ 5 + a * 100
# snip
[
Expr(
Binary {
op: Add,
lhs: Literal {
n: 5,
},
rhs: Binary {
op: Mul,
lhs: VariableRef {
var: "a",
},
rhs: Literal {
n: 100,
},
},
},
),
]
Migrating to an arena
We currently use Box
to add indirection where Expr
recurses. Let’s take a look at the layout of Expr
in memory:
Each of those arrows is a pointer to another location in memory, one that could be far away; there is a significant chance that the HIR elements are spread out across the heap. This is bad for performance, because it means the CPU can’t keep them all in the cache at the same time.
We can instead store each Expr
in a Vec
, and index into this Vec
instead of using Box
es to reference the subtrees. Here’s a visualisation of the same Expr
’s memory layout with this setup:
Since Vec
’s storage is contiguous, we know that each Expr
will be right next to all the others, encouraging caching and increasing performance. A secondary benefit of this approach is that the indices into the Vec
don’t necessarily have to be usize
s – we could use u32
s instead to save memory, since it’s unlikely we’ll have more than 232 – 1 = four billion Expr
subtrees.
A Vec
gives us too much control over the subtrees; all we want is to allocate another subtree, and get its index. Luckily, there is a data structure that does this job: the arena. There are lots of arena implementations in Rust. Each one that I have seen has more complexity than the basic Vec
-like structure I described earlier. In some cases this is to allow the allocation of more than usize::MAX
elements; in other cases the complexity allows removing elements one by one without invalidating existing indexes; in others still, it allows customisation of the index type (e.g. usize
).
Let’s write our own, as-simple-as-possible arena implementation. We’ll start with a new crate:
$ cargo new --lib crates/arena
Created library `crates/arena` package
// crates/arena/src/lib.rs
#[derive(Debug)]
pub struct Arena<T> {
data: Vec<T>,
}
We need methods to create new arenas and to allocate values on the arena:
impl<T> Default for Arena<T> {
fn default() -> Self {
Self::new()
}
}
impl<T> Arena<T> {
pub fn new() -> Self {
Self { data: Vec::new() }
}
pub fn alloc(&mut self, t: T) -> usize {
let idx = self.next_idx();
self.data.push(t);
idx
}
fn next_idx(&self) -> usize {
self.data.len()
}
}
We also need to allow indexing into the arena to retrieve values:
use std::ops::Index;
// snip
impl<T> Index<usize> for Arena<T> {
type Output = T;
fn index(&self, idx: usize) -> &Self::Output {
&self.data[idx]
}
}
Let’s employ that index size optimisation from earlier, where we use u32
s for indexes instead of usize
s:
impl<T> Arena<T> {
// snip
pub fn alloc(&mut self, t: T) -> u32 {
let idx = self.next_idx();
self.data.push(t);
idx
}
fn next_idx(&self) -> u32 {
self.data.len() as u32
}
}
impl<T> Index<u32> for Arena<T> {
type Output = T;
fn index(&self, idx: u32) -> &Self::Output {
&self.data[idx as usize]
}
}
We’ve created a new problem for ourselves by doing this, though: when reading code, usize
s are often used as indexes, so we can be reasonably sure what their purpose is. However, now that we’re using u32
s, it can get confusing – if you see
// This is just an example; don’t write this!
struct BinaryExpr {
lhs: u32,
rhs: u32,
op: BinaryOp,
}
in a codebase, it isn’t clear what the usage of that u32
looks like. We can alleviate this problem by wrapping the u32
in a newtype:
#[derive(Debug)]
pub struct Idx {
raw: u32,
}
We can replace all usages of u32
with Idx
and add a little bit of glue:
impl<T> Arena<T> {
// snip
pub fn alloc(&mut self, t: T) -> Idx {
let idx = self.next_idx();
self.data.push(t);
idx
}
fn next_idx(&self) -> Idx {
Idx {
raw: self.data.len() as u32,
}
}
}
impl<T> Index<Idx> for Arena<T> {
type Output = T;
fn index(&self, idx: Idx) -> &Self::Output {
&self.data[idx.raw as usize]
}
}
Let’s take a look at that example from earlier again with this change:
// This is just an example; don’t write this!
struct BinaryExpr {
lhs: Idx,
rhs: Idx,
op: BinaryOp,
}
It’s better, but it’s still unclear. What type do you get as a result of indexing? Let’s add a type parameter to Idx
to improve this further:
#[derive(Debug)]
pub struct Idx<T> {
raw: u32,
_phantom: PhantomData<fn() -> T>,
}
PhantomData
is a way of telling the Rust compiler that, although this type may not actually use the T
type parameter, it still should have the parameter and shouldn’t have an ‘unused type parameter’ error. We don’t use PhantomData<T>
in this case, because that communicates to the compiler that Idx
owns a T
, which can affect drop check. PhantomData<fn() -> T>
avoids this.
Let’s update all of Idx
’s usages:
impl<T> Arena<T> {
// snip
pub fn alloc(&mut self, t: T) -> Idx<T> {
let idx = self.next_idx();
self.data.push(t);
idx
}
fn next_idx(&self) -> Idx<T> {
Idx {
raw: self.data.len() as u32,
_phantom: PhantomData,
}
}
}
impl<T> Index<Idx<T>> for Arena<T> {
type Output = T;
fn index(&self, idx: Idx<T>) -> &Self::Output {
&self.data[idx.raw as usize]
}
}
Now that our arena implementation is complete, we can make use of it in the HIR:
# crates/hir/Cargo.toml
[dependencies]
arena = {path = "../arena"}
ast = {path = "../ast"}
smol_str = "0.1.17"
syntax = {path = "../syntax"}
The first thing we need is a type to store the arena:
// lib.rs
mod database;
pub use database::Database;
// crates/hir/src/database.rs
use crate::Expr;
use arena::Arena;
#[derive(Debug, Default)]
pub struct Database {
exprs: Arena<Expr>,
}
Let’s replace Box<Expr>
with Idx<Expr>
:
// lib.rs
use arena::Idx;
use smol_str::SmolStr;
use syntax::SyntaxKind;
// snip
#[derive(Debug)]
pub enum Expr {
Missing,
Binary {
op: BinaryOp,
lhs: Idx<Self>,
rhs: Idx<Self>,
},
Literal {
n: u64,
},
Unary {
op: UnaryOp,
expr: Idx<Self>,
},
VariableRef {
var: SmolStr,
},
}
We can make this a little cleaner by creating a type alias for Idx<Expr>
:
type ExprIdx = Idx<Expr>;
// snip
#[derive(Debug)]
pub enum Expr {
Missing,
Binary {
op: BinaryOp,
lhs: ExprIdx,
rhs: ExprIdx,
},
Literal {
n: u64,
},
Unary {
op: UnaryOp,
expr: ExprIdx,
},
VariableRef {
var: SmolStr,
},
}
All the lowering methods need to take a &mut Database
as a parameter:
impl Stmt {
fn lower(ast: ast::Stmt, db: &mut Database) -> Option<Self> {
let result = match ast {
ast::Stmt::VariableDef(ast) => Self::VariableDef {
name: ast.name()?.text().clone(),
value: Expr::lower(ast.value(), db),
},
ast::Stmt::Expr(ast) => Self::Expr(Expr::lower(Some(ast), db)),
};
Some(result)
}
}
impl Expr {
fn lower(ast: Option<ast::Expr>, db: &mut Database) -> Self {
if let Some(ast) = ast {
match ast {
ast::Expr::BinaryExpr(ast) => Self::lower_binary(ast, db),
// snip
ast::Expr::ParenExpr(ast) => Expr::lower(ast.expr(), db),
ast::Expr::UnaryExpr(ast) => Self::lower_unary(ast, db),
// snip
}
} else {
// snip
}
}
fn lower_binary(ast: ast::BinaryExpr, db: &mut Database) -> Self {
// snip
Self::Binary {
op,
lhs: db.exprs.alloc(Expr::lower(ast.lhs(), db)),
rhs: db.exprs.alloc(Expr::lower(ast.rhs(), db)),
}
}
fn lower_unary(ast: ast::UnaryExpr, db: &mut Database) -> Self {
// snip
Self::Unary {
op,
expr: db.exprs.alloc(Expr::lower(ast.expr(), db)),
}
}
}
We’ve got a code smell here: we’re threading the same value through each and every method. This is a sign that these methods should be impl
ed on the type of that value, which in this case is &mut Database
. Let’s change the lowering code to be methods on Database
:
// database.rs
use crate::{BinaryOp, Expr, Stmt, UnaryOp};
use arena::Arena;
use syntax::SyntaxKind;
// snip
impl Database {
pub(crate) fn lower_stmt(&mut self, ast: ast::Stmt) -> Option<Stmt> {
let result = match ast {
ast::Stmt::VariableDef(ast) => Stmt::VariableDef {
name: ast.name()?.text().clone(),
value: self.lower_expr(ast.value()),
},
ast::Stmt::Expr(ast) => Stmt::Expr(self.lower_expr(Some(ast))),
};
Some(result)
}
pub(crate) fn lower_expr(&mut self, ast: Option<ast::Expr>) -> Expr {
if let Some(ast) = ast {
match ast {
ast::Expr::BinaryExpr(ast) => self.lower_binary(ast),
ast::Expr::Literal(ast) => Expr::Literal { n: ast.parse() },
ast::Expr::ParenExpr(ast) => self.lower_expr(ast.expr()),
ast::Expr::UnaryExpr(ast) => self.lower_unary(ast),
ast::Expr::VariableRef(ast) => Expr::VariableRef { var: ast.name() },
}
} else {
Expr::Missing
}
}
fn lower_binary(&mut self, ast: ast::BinaryExpr) -> Expr {
let op = match ast.op().unwrap().kind() {
SyntaxKind::Plus => BinaryOp::Add,
SyntaxKind::Minus => BinaryOp::Sub,
SyntaxKind::Star => BinaryOp::Mul,
SyntaxKind::Slash => BinaryOp::Div,
_ => unreachable!(),
};
Expr::Binary {
op,
lhs: self.exprs.alloc(self.lower_expr(ast.lhs())),
rhs: self.exprs.alloc(self.lower_expr(ast.rhs())),
}
}
fn lower_unary(&mut self, ast: ast::UnaryExpr) -> Expr {
let op = match ast.op().unwrap().kind() {
SyntaxKind::Minus => UnaryOp::Neg,
_ => unreachable!(),
};
Expr::Unary {
op,
expr: self.exprs.alloc(self.lower_expr(ast.expr())),
}
}
}
Finally, we need to update hir::lower
to call Database::lower_stmt
:
// lib.rs
pub fn lower(ast: ast::Root) -> (Database, impl Iterator<Item = Stmt>) {
let mut db = Database::default();
(db, ast.stmts().filter_map(|stmt| db.lower_stmt(stmt)))
}
Note how we return the database so that the user of the function can use the otherwise-opaque arena indexes. This function doesn’t compile though, because returning the iterator means that the closure passed to filter_map
won’t be run until the caller decides to consume the iterator. Rust can’t verify that db
will still be alive when the closure is run, so it gives us an error. We can fix this by collecting the iterator to a Vec
:
pub fn lower(ast: ast::Root) -> (Database, Vec<Stmt>) {
let mut db = Database::default();
let stmts = ast.stmts().filter_map(|stmt| db.lower_stmt(stmt)).collect();
(db, stmts)
}
Next, we have several errors caused by borrowing mutably more than once at a time. These can be fixed by extracting the first mutable borrows to temporary variables:
// database.rs
impl Database {
// snip
fn lower_binary(&mut self, ast: ast::BinaryExpr) -> Expr {
// snip
let lhs = self.lower_expr(ast.lhs());
let rhs = self.lower_expr(ast.rhs());
Expr::Binary {
op,
lhs: self.exprs.alloc(lhs),
rhs: self.exprs.alloc(rhs),
}
}
fn lower_unary(&mut self, ast: ast::UnaryExpr) -> Expr {
// snip
let expr = self.lower_expr(ast.expr());
Expr::Unary {
op,
expr: self.exprs.alloc(expr),
}
}
}
The last remaining error stems from us trying to collect
the result of hir::lower
in the REPL. Let’s remove this:
// crates/eldiro/src/main.rs
fn main() -> io::Result<()> {
// snip
loop {
// snip
dbg!(hir::lower(root));
input.clear();
}
}
We also have an unused import that we should delete:
// crates/hir/src/lib.rs
mod database;
pub use database::Database;
use arena::Idx;
use smol_str::SmolStr;
The arena we’ve created is very similar to la-arena, the arena used in rust-analyzer, which is published on crates.io. In fact, la-arena’s implementation is essentially the same as ours, with a few more bells and whistles.2 Let’s delete our implementation, and use la-arena instead:
$ rm -r crates/arena
# Cargo.toml
[dependencies]
ast = {path = "../ast"}
la-arena = "0.2.0"
smol_str = "0.1.17"
syntax = {path = "../syntax"}
// lib.rs
mod database;
pub use database::Database;
use la_arena::Idx;
use smol_str::SmolStr;
// database.rs
use crate::{BinaryOp, Expr, Stmt, UnaryOp};
use la_arena::Arena;
use syntax::SyntaxKind;
Conclusion
I hope you’re enjoying the series; the next part will be a shorter one focusing on tests.