website/content/blog/stack-vm-part-4.md
James Harton af54dd8216
All checks were successful
continuous-integration/drone/push Build is passing
feat: add ash-related posts.
2023-08-16 17:18:23 +12:00

181 lines
7 KiB
Markdown

---
title: Building a stack-based virtual machine, part 4 - code
published: true
description: The fourth part in my exploration of stack based VMs and how they work
date: 2018-09-14
taxonomies:
tags:
- rust
- language
- theory
- computerscience
---
> This post is mirrored from it's original location on [dev.to](https://dev.to/jimsy/building-a-stack-based-virtual-machine-part-4---code-3lmi).
Welcome back.
Now that we have our [instructions](@/blog/stack-vm-part-3.md) and our [stack](@/blog/stack-vm-part-2.md), we need a program that we can run. I've called this `Code`.
So what is code?
Let's look at a really simple assembly program to get some ideas:
```text
@0 = 123
@1 = 456
.main:
push @0
push @1
add
```
This is not a _real_ assembly program, but a simplified version [I came up with](https://gitlab.com/jimsy/stack-vm/blob/master/src/code/debug.rs) for the `stack-vm` debugging output which is loosely based on the [LLVM IR text format](https://llvm.org/docs/LangRef.html). It shows us some important things about a program. A program needs:
- _Data_ - here the constants `123` and `456` are assigned to the names `@0` and `@1`.
- _Labels_ - in this case the label `main` points to the first instruction (instruction pointer = 0). In order to have functions or flow control we need to be able to label sections of code so that we can easily jump to them. We don't jump directly to instruction pointers because we'd probably make a mistake (remember [GOTO](http://www.u.arizona.edu/~rubinson/copyright_violations/Go_To_Considered_Harmful.html)?).
- _Instructions_ - the meat and veg of our program; here we use the instruction `push` twice with one argument each time and then we call the `add` instruction.
So what does this boil down to? Our `Code` struct needs to contain:
- A "data section" - or a list of constant operands.
- A "program section" - or a list of instruction op codes and indexes into the data section.
- A mapping of labels to instruction pointers (or indexes into the program section).
- A list of instruction names (symbols) stored by op code for easier debugging.
With all that in mind here's our `Code` type:
```rust
pub struct Code<T> {
pub symbols: Vec<(usize, String)>,
pub code: Vec<usize>,
pub data: Vec<T>,
pub labels: Vec<(usize, String)>
}
```
We could write our software by just inserting values directly into these vectors but it's not very common to write assembly code by hand. It's much more common to generate code programmatically as an artifact of compiling a higher level language. In order to do that we're going to create our own builder (sometimes called a "code generator" or "assembler").
So what does our builder need to know in order to function?
- An instruction table as outlined in [part 3](@/blog/stack-vm-part-3.md). This is so that we can have access to labels, opcodes and instruction arities.
- _Data_, _Labels_ and _Instructions_ much like our Code type.
> Why separate `Code` and `Builder` if they're mostly the same? I'm glad you asked.
> We _could_ conflate the two use cases together but the main reason they're separated is because the builder needs access to the instruction table for verification when inserting instructions, but code assumes that this process has already been done and can be used to serialising and deserialising to disk or for execution.
Here's our `Builder`:
```rust
pub struct Builder<'a, T: 'a> {
pub instruction_table: &'a InstructionTable<T>,
pub instructions: Vec<usize>,
pub labels: WriteOnceTable<usize>,
pub data: Vec<T>,
}
impl<'a, T> Builder<'a, T> {
pub fn new(instruction_table: &'a InstructionTable<T>) -> Builder<T> {
let mut labels = WriteOnceTable::new();
labels.insert("main", 0);
Builder {
instruction_table: &instruction_table,
instructions: vec![],
labels: labels,
data: vec![],
}
}
}
```
> Don't worry about `WriteOnceTable` - it's just a wrapper around `HashMap` that only lets you set a value once.
So what behaviour does our builder need?
- Add instructions and their arguments into the program.
- Insert labels.
So let's cover off adding instructions first because it's by far the more complex part of the code.
```rust
impl<'a, T> Builder<'a, T> {
pub fn push(&mut self, name: &str, args: Vec<T>) {
let instr = self
.instruction_table
.by_name(name)
.expect(&format!("Unable to find instruction with name {:?}", name));
if args.len() != instr.arity {
panic!("Instruction {} has arity of {}, but you provided {} arguments.", instr.name, instr.arity, args.len())
}
self.instructions.push(instr.op_code);
self.instructions.push(instr.arity);
for arg in args {
let pos = self.push_data(arg);
self.instructions.push(pos);
}
}
}
```
So given the name of the instruction and a collection of operands we take the following steps:
- Look up the instruction in the instruction table.
- Verify that the number of arguments we've been given and the instruction arity match.
- Push the instruction's opcode into the program.
- Push the instruction's arity into the program. This might seem like it's not necessary but is needed for the `Code` type to be independent of the `InstructionTable`.
- Push each argument into the data section and it's index into the program.
Now on to labels:
```rust
impl<'a, T> Builder<'a, T> {
pub fn label(&mut self, name: &str) {
let idx = self.instructions.len();
self.labels.insert(name, idx);
}
}
```
This is pretty straight forward; we simply look up the number of instructions currently in the program and store the name pointing to it.
The last bit of the job is being able to convert a `Builder` into a `Code`. This is a relatively straight forward task:
- Grab all the instruction names (symbols) from the instruction table.
- Grab the program section.
- Grab the data section.
- Grab the labels and sort them by their IP.
- Stuff them all in the Code object.
```rust
impl<'a, T> From<Builder<'a, T>> for Code<T> {
fn from(builder: Builder<T>) -> Code<T> {
let symbols = builder.instruction_table.symbols();
let code = builder.instructions;
let data = builder.data;
let mut labels = vec![];
for key in builder.labels.keys() {
let idx = builder.labels.get(key).unwrap();
labels.push((*idx, key.clone()));
}
labels.sort_by(|lhs, rhs| {
lhs.0.cmp(&rhs.0)
});
Code {
symbols: symbols,
code: code,
data: data,
labels: labels
}
}
}
```
Phew! That's a pretty dense episode for you to watch. I haven't included the tests, but believe me [they're all there](https://gitlab.com/jimsy/stack-vm/blob/master/src/code/mod.rs#L187). In the [next episode](@/blog/stack-vm-part-5.md) we'll wire everything up into a machine and actually run a program!