The most fundamental part of any language is the parser1 – a piece of software whose purpose is to take a flat structure (usually text in some form) and convert it into a tree structure. In this post, we’ll make a parser for mathematical expressions that don’t contain nesting. For example, 1 + 1
is allowed, but 2 * 3 + 4
isn’t (because that’s shorthand for (2 * 3) + 4
, and thus contains nesting). I’ve chosen to keep it this simple so that we don’t need to worry about order of operations.2
Let’s think about the problem: the only two elements that are allowed in our expressions are numbers and operators. If we can make a parser for numbers and a parser for operators, we’ll be well on our way to finishing the first incarnation of our parser.
A parser for numbers
As we’re good test-driven developers, we’ll start by writing a test. First, we open src/lib.rs
and are greeted by this:
#[cfg(test)]
mod tests {
#[test]
fn it_works() {
assert_eq!(2 + 2, 4);
}
}
Let’s remove the it_works
test, and change it to a test that checks if we can parse numbers:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_number() {
assert_eq!(Number::new("123"), Number(123));
}
}
Let’s follow TDD some more: what’s the shortest thing we can write that makes this pass? Well, first, we need to make it compile.
#[derive(Debug, PartialEq)]
pub struct Number(pub i32);
impl Number {
pub fn new(s: &str) -> Self {}
}
This still doesn’t compile, because new
doesn’t return anything. Well, the easiest thing we can do is use the parse
method on str
:
impl Number {
pub fn new(s: &str) -> Self {
Self(s.parse().unwrap())
}
}
Now it compiles! Let’s see if the tests pass:
$ cargo t
Compiling eldiro v0.1.0 (/home/me/src/eldiro)
Finished test [unoptimized + debuginfo] target(s) in 0.14s
Running /home/me/.cache/cargo-target/debug/deps/eldiro-4b82c3c57a78933f
running 1 test
test tests::parse_number ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests eldiro
running 0 tests
Hooray, it works!
A parser for operators
We’ll start by writing tests for addition, subtraction, multiplication and division:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_add_op() {
assert_eq!(Op::new("+"), Op::Add);
}
#[test]
fn parse_sub_op() {
assert_eq!(Op::new("-"), Op::Sub);
}
#[test]
fn parse_mul_op() {
assert_eq!(Op::new("*"), Op::Mul);
}
#[test]
fn parse_div_op() {
assert_eq!(Op::new("/"), Op::Div);
}
}
Op
is undefined, so let’s declare that:
#[derive(Debug, PartialEq)]
pub enum Op {
Add,
Sub,
Mul,
Div,
}
Now the only remaining compile error is that Op::new
doesn’t exist:
impl Op {
pub fn new(s: &str) -> Self {
match s {
"+" => Self::Add,
"-" => Self::Sub,
"*" => Self::Mul,
"/" => Self::Div,
_ => panic!("bad operator"),
}
}
}
Instead of handling the case where we’ve been given a non-existent operator, we just crash the program. Obviously this is far from ideal, but we’re keeping it really simple at the moment.
$ cargo t
// snip
running 5 tests
test tests::parse_add_op ... ok
test tests::parse_div_op ... ok
test tests::parse_mul_op ... ok
test tests::parse_sub_op ... ok
test tests::parse_number ... ok
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
// doc-test stuff that we’re not interested in
Nice! We’ve now got the ingredients we need to parse our ultra-simplified mathematical expressions.
Putting it all together
As usual, we’ll write a test:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_one_plus_two() {
assert_eq!(
Expr::new("1+2"),
Expr {
lhs: Number(1),
rhs: Number(2),
op: Op::Add,
},
);
}
}
In case you’re not familiar with the jargon, Expr
is a common shorthand for Expression
, lhs
is short for left-hand side, and, shockingly, rhs
stands for right-hand side.
Expr
will be the type that holds the structure of what an expression actually is in Eldiro. Since we don’t support nesting, Expr
’s definition is straightforward:
#[derive(Debug, PartialEq)]
pub struct Expr {
pub lhs: Number,
pub rhs: Number,
pub op: Op,
}
Let’s tentatively define the new
method:
impl Expr {
pub fn new(s: &str) -> Self {
let lhs = Number::new(s);
let rhs = Number::new(s);
let op = Op::new(s);
Self { lhs, rhs, op }
}
}
This won’t work, though, because we’re trying to parse the left-hand side (as well as the right-hand side and the operator) from the entire input, when in reality parsing the left-hand side only needs to look at a small portion at the start of the input. The same goes for the right-hand side and the operator, too. Let’s run the tests to see if we’re correct:
$ cargo t
running 6 tests
test tests::parse_add_op ... ok
test tests::parse_div_op ... ok
test tests::parse_number ... ok
test tests::parse_mul_op ... ok
test tests::parse_sub_op ... ok
test tests::parse_one_plus_two ... FAILED
failures:
---- tests::parse_one_plus_two stdout ----
thread 'tests::parse_one_plus_two' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', src/lib.rs:6:24
And sure enough, the test fails. Hmm …
So, what we need to do is … extract the numbers at the start of the input? Something like this:
assert_eq!(extractor_thing("1+2"), "1");
Well, if we’ve now got the left-hand side as its own &str
, we need some way to get the operator. I mean, if we just scanned past the left-hand side, why don’t we continue from where we left off? To do that, extractor_thing
will have to return a pair of the part of the input it extracted, as well as the leftovers:
// extracted portion ╮
// leftover ╮ │
assert_eq!(extractor_thing("1+2"), ("+2", "1"));
Let’s get started on writing this extractor_thing
in a new module called utils
. Create the file src/utils.rs
and add mod utils;
to the top of lib.rs
.
// utils.rs
pub(crate) fn extract_digits(s: &str) -> (&str, &str) {
(&s[1..], &s[0..1])
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn extract_one_digit() {
assert_eq!(extract_digits("1+2"), ("+2", "1"));
}
}
Well, the tests pass, but we’re not thinking ahead; what if the number is more than one digit long?
pub(crate) fn extract_digits(s: &str) -> (&str, &str) {
let mut digits_end = 0;
for (idx, c) in s.char_indices() {
if c.is_ascii_digit() {
digits_end = idx + 1;
} else {
break;
}
}
let digits = &s[..digits_end];
let remainder = &s[digits_end..];
(remainder, digits)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn extract_one_digit() {
assert_eq!(extract_digits("1+2"), ("+2", "1"));
}
#[test]
fn extract_multiple_digits() {
assert_eq!(extract_digits("10-20"), ("-20", "10"));
}
}
This works, but isn’t very idiomatic, with that for
loop and mut
ation. Let’s rewrite it to use iterator adaptors:
pub(crate) fn extract_digits(s: &str) -> (&str, &str) {
let digits_end = s
.char_indices()
.find_map(|(idx, c)| if c.is_ascii_digit() { None } else { Some(idx) })
.unwrap_or_else(|| s.len());
let digits = &s[..digits_end];
let remainder = &s[digits_end..];
(remainder, digits)
}
What we’re doing here is looking at all the characters (and their respective indices) in s
– if the character is an ASCII digit (i.e. anywhere from 0
to 9
) we move onto the next one. However, if the character isn’t an ASCII digit, we stop there and return the index of that character. On the off-chance that we never encounter a non-digit character, the index where the digits end is the length of the input (as the entire input consists of digits). We should probably add some tests to verify that:
#[cfg(test)]
mod tests {
// snip
#[test]
fn do_not_extract_anything_from_empty_input() {
assert_eq!(extract_digits(""), ("", ""));
}
#[test]
fn extract_digits_with_no_remainder() {
assert_eq!(extract_digits("100"), ("", "100"));
}
}
And, indeed, that does work:
$ cargo t
running 9 tests
test tests::parse_add_op ... ok
test tests::parse_number ... ok
test tests::parse_sub_op ... ok
test tests::parse_div_op ... ok
test tests::parse_mul_op ... ok
test utils::tests::do_not_extract_anything_from_empty_input ... ok
test utils::tests::extract_multiple_digits ... ok
test utils::tests::extract_one_digit ... ok
test utils::tests::extract_digits_with_no_remainder ... ok
test tests::parse_one_plus_two ... FAILED
Huh, what’s that failed test there? Oh, right, we were trying to parse "1+2"
.
Remember this?
// lib.rs
impl Expr {
pub fn new(s: &str) -> Self {
let lhs = Number::new(s);
let rhs = Number::new(s);
let op = Op::new(s);
Self { lhs, rhs, op }
}
}
Let’s convert it to use our new extract_digits
function:
impl Expr {
pub fn new(s: &str) -> Self {
let (s, lhs) = utils::extract_digits(s);
let lhs = Number::new(lhs);
let (s, rhs) = utils::extract_digits(s);
let rhs = Number::new(rhs);
// Uhhh ...
let op = Op::new(s);
Self { lhs, rhs, op }
}
}
This won’t do, this won’t do at all! Not only are we parsing the operator after the right-hand side (even though it should come before), but we also haven’t applied the extraction logic we used for numbers to parsing operators. I guess we should head back to utils.rs
and write up another function.
pub(crate) fn extract_op(s: &str) -> (&str, &str) {
match &s[0..1] {
"+" | "-" | "*" | "/" => {}
_ => panic!("bad operator"),
}
(&s[1..], &s[0..1])
}
#[cfg(test)]
mod tests {
// snip
#[test]
fn extract_plus() {
assert_eq!(extract_op("+2"), ("2", "+"));
}
#[test]
fn extract_minus() {
assert_eq!(extract_op("-10"), ("10", "-"));
}
#[test]
fn extract_star() {
assert_eq!(extract_op("*3"), ("3", "*"));
}
#[test]
fn extract_slash() {
assert_eq!(extract_op("/4"), ("4", "/"));
}
}
Let’s apply this in Expr::new
:
impl Expr {
pub fn new(s: &str) -> Self {
let (s, lhs) = utils::extract_digits(s);
let lhs = Number::new(lhs);
let (s, op) = utils::extract_op(s);
let op = Op::new(op);
let (s, rhs) = utils::extract_digits(s);
let rhs = Number::new(rhs);
Self { lhs, rhs, op }
}
}
Every time we extract something from the input, we declare a new binding with the same name as the initial input s
so that everything we do with s
afterwards doesn’t include the part we ‘consumed’.
Really, though, shouldn’t Number::new
and Op::new
handle extracting the relevant text from the input themselves? Let’s update the tests:
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_number() {
assert_eq!(Number::new("123"), ("", Number(123)));
}
#[test]
fn parse_add_op() {
assert_eq!(Op::new("+"), ("", Op::Add));
}
#[test]
fn parse_sub_op() {
assert_eq!(Op::new("-"), ("", Op::Sub));
}
#[test]
fn parse_mul_op() {
assert_eq!(Op::new("*"), ("", Op::Mul));
}
#[test]
fn parse_div_op() {
assert_eq!(Op::new("/"), ("", Op::Div));
}
}
Since all our tests completely consume their inputs, there’s never any leftover input. To represent this, we’ve wrapped the expected output of each parser in a tuple where the first item is an empty string literal.
Let’s update all our parsers to independently extract text from the input:
impl Number {
pub fn new(s: &str) -> (&str, Self) {
let (s, number) = utils::extract_digits(s);
(s, Self(number.parse().unwrap()))
}
}
// snip
impl Op {
pub fn new(s: &str) -> (&str, Self) {
let (s, op) = utils::extract_op(s);
let op = match op {
"+" => Self::Add,
"-" => Self::Sub,
"*" => Self::Mul,
"/" => Self::Div,
_ => unreachable!(),
};
(s, op)
}
}
// snip
impl Expr {
pub fn new(s: &str) -> Self {
let (s, lhs) = Number::new(s);
let (s, op) = Op::new(s);
let (s, rhs) = Number::new(s);
Self { lhs, rhs, op }
}
}
Look at how clean Expr::new
is now! But we still have one problem: what if Expr::new
doesn’t fully consume its input? Let’s make Expr::new
return a (&str, Self)
like all the other parsers:
impl Expr {
pub fn new(s: &str) -> (&str, Self) {
let (s, lhs) = Number::new(s);
let (s, op) = Op::new(s);
let (s, rhs) = Number::new(s);
(s, Self { lhs, rhs, op })
}
}
// snip
#[cfg(test)]
mod tests {
// snip
#[test]
fn parse_one_plus_two() {
assert_eq!(
Expr::new("1+2"),
(
"",
Expr {
lhs: Number(1),
rhs: Number(2),
op: Op::Add,
},
),
);
}
}
Let’s finish off this instalment of Make A Language by basking in the glory of a passing test suite.
$ cargo t
Compiling eldiro v0.1.0 (/home/me/src/eldiro)
Finished test [unoptimized + debuginfo] target(s) in 0.46s
Running /home/me/.cache/cargo-target/debug/deps/eldiro-4b82c3c57a78933f
running 14 tests
test tests::parse_add_op ... ok
test tests::parse_number ... ok
test tests::parse_div_op ... ok
test tests::parse_mul_op ... ok
test utils::tests::extract_minus ... ok
test utils::tests::extract_multiple_digits ... ok
test tests::parse_sub_op ... ok
test utils::tests::extract_one_digit ... ok
test tests::parse_one_plus_two ... ok
test utils::tests::extract_digits_with_no_remainder ... ok
test utils::tests::do_not_extract_anything_from_empty_input ... ok
test utils::tests::extract_plus ... ok
test utils::tests::extract_slash ... ok
test utils::tests::extract_star ... ok
test result: ok. 14 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests eldiro
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Beautiful! Next time, we’ll add support for whitespace.
For most languages it’s actually the lexer, but we’ll worry about that when we improve the parser in a future post. ↩︎
If you really want to learn about operator precedence and how you can handle it in a parser, take a look at this great article. Yet again, I will only cover this topic once we’ve rewritten the parser. ↩︎