Building a Toy Language Compiler in Go


A follow-up to my previous post about building a toy language interpreter in Go. Throughout this post, I'll be referencing my repo.

We'll go over a compiler which translates source code to machine code, and then a VM to execute this machine code.

What is a compiler? 

A compiler takes source code and translates it to executable machine code. (See my previous post for more details.) The figure below shows the high-level architecture of a simple compiler.


Like the interpreter we built previously, our compiler has a lexer and a parser which turn our source code into an AST. This is our compiler's frontend. However, that's where a compiler diverges. 

The compiler may translate the AST into at least one internal representation (IR). The IR(s) may be represented using any data structure (even another AST!), but they should follow two rules: (1) be easy to produce, (2) be easy to translate into the target machine code. The IR can also be optimized, most commonly for faster performance. There are lots of techniques for improving runtime while only slowing down compilation time a little. 

Finally, the code generator (our compiler's backend) takes the optimized IR and translates it to the target machine code. One of the most important responsibilities of the code generator is assigning registers to hold variables.

To run our toy compiler, use the command go build -o toy && ./toy -engine=vm.

Bytecode

We start by defining our bytecode instructions. This is what our compiler produces, and what our VM will execute later on. Each bytecode instruction is a sequence of bytes. It starts with an opcode (which is a unique value that's 1 byte long),  that represents the operation. The opcode is followed by 0 or more operands, which are each 1 or 2 bytes long in our toy compiler. We're going to use Big Endian representation, which just means that the most significant byte comes first.

We'll be using a stack machine VM later on, which means our instructions will be executing values on a stack. For now, all we have to know is that our bytecode instructions should be doing stack arithmetic.

The most important method in our bytecode.go file is bytecode.Make(), which creates an instruction with the first parameter as the operator, and the subsequent parameters as the operands. 

Compiling a simple instruction  

Our compiler walks the AST and does two things (in its Compile()function):
  • Adds constants to a constant pool (the function addConstant()does this). Our constant pool is just an array of objects, and our VM will retrieve constants in instructions by their indices. 
  • Emits instructions (the function emit()does this). In our case, emit just means "generating an instruction and adding it to our collection in memory."
For example, our simple AST for "1 + 2" will generate the sequence of instructions:
  • bytecode.Make(bytecode.OpConstant, 0)
  • bytecode.Make(bytecode.OpConstant, 1)
  • bytecode.Make(bytecode.OpAdd)
  • bytecode.Make(bytecode.OpPop)
The first instruction refers to our constant "1", which is at the 0th index in our constant pool. The second instruction refers to our constant "2", which is at the 1st index in our constant pool. The third instruction is our "add" operation. The last instruction is because we want to pop values after expressions, so later on, our VM's stack won't fill up too fast.

Running a simple instruction 

Our VM now takes our instructions, and actually executes them using a stack. For example, when it encounters "1+2"'s instructions...


It does this through the run() method, which is just a fetch-decode-execute cycle (also known as an instruction cycle). Our VM simulates what an actual computer does:

  • CPU fetches instruction from memory. (The PC tells CPU where to find the next instruction.) 
  • CPU decodes instruction to figure out what operation to execute.
  • CPU executes instruction. 
Simple enough! We've compiled and executed our first instruction! 

Conditionals: Using Backpatching 

We're now going to go over our first tricky concept: how to execute if and else statements. Let's go over an example:

if (true) {
    10;
} else {
    20;
}

We will make use of two operations:

  • OpJumpNotTruthy: jump to the instruction specified by the first operand if the condition is not truthy (e.g. the condition evaluates to false or null)
  • OpJump: jump, no matter what, to the instruction specified by the first operand 

Our instructions will look like this (where the indices are to the left of the instructions):

The if and else case both work out:

  • If
    • OpTrue is executed
    • OpJumpNotTruthy is executed, but the VM won't jump because the condition evaluated to true
    • OpConstant 0 is executed
    • OpJump 13 is executed, and the VM will jump in this case, because OpJump tells the VM to jump no matter what
  • Else
    • OpFalse is executed (let's say for the sake of this example)
    • OpJumpNotTruthy is executed, and the VM jumps because the condition evaluated to false
    • OpConstant 1 is executed 
This is all good so far, but one question remains: how does the compiler know to put 10 and 13 for the operands of OpJumpNotTruthy and OpJump? It can't look ahead into the future...

This is where backpatching happens. When the compiler first emits OpJumpNotTruthy and OpJump, it actually put a placeholder operand value (a junk value, in our case, 9999). It saves the indices of the OpJumpNotTruthy and OpJump instructions, however, so we can fix their operands later on. Once the compiler compiles the consequence and the alternative block of code, it goes back and replaces the operands of the OpJumpNotTruthy and OpJump instructions. 

Global and Local Bindings: Using a Symbol Table

For bindings to work, we add a symbol table to our compiler to store information about each identifier we hit. For each symbol in our symbol table, we store the name of our identifier, its index (so we can retrieve it), and its scope (e.g. the variable "foo" is in an inner scope if it's defined inside a function body, and in a global scope if it's defined outside a function body.) Including a scope lets us implement both global and local bindings.

We also implement Define() and Resolve() methods, so we can add an identifier to our symbol table every time our compiler encounters a let statement, and get the value of an identifier ever time our compiler encounters an identifier. Just like our constant pool, we have a unique index for each of our symbols. Depending on the identifier's scope, we work with either our outer "global" symbol table, or our inner "local" symbol table.

Compiling Functions: Using Scopes

We need to add some more operations to get compiling functions to work:
  • OpCall: when the VM sees this, it'll execute the compiled function object on top of the stack
  • OpReturnValue: the VM returns the value on top of the stack
  • OpReturnNothing: the VM returns NULL 
We add a stack of scopes for our compiler. We enter a scope (and push it onto our stack) every time we compile a function body. While compiling, we only emit instructions to modify our scope. When we finish compiling our function, we leave our scope (and pop it off the stack), and push our compiled function object. 

Running Functions: Using Frames

There are a couple of things we need to keep in mind when figuring out how to execute functions. 
  • When we call a function, we need to change our instructions and instruction pointer.
  • When we finish executing a function, we need to restore the original instructions and instruction pointer.
This is where call frames come in to our VM. A frame points to a compiled function, the instruction pointer of the compiled function, and a base pointer, which is the location of the bottom of the stack frame. Our VM will contain a stack of frames that we push to and pop from when we encounter functions.

Acknowledgments  

The code is based on the book https://compilerbook.com/. I also reference the Dragon Book. Both books are highly recommended if you find this blog post interesting!