Scifi books: Grand Cru 2018

I read a bunch of sci-fi this year. From light short stories to larger epopea. One common thing between all those books is that they manage to dream about all the possibilities the future can offer us:

  • spaceships? I want.
  • Rogue AI? Amazing.
  • Rejuvenation? Downloading your thoughts to your brain? That is the real killer.

Maybe the fact that most of them offer a solution to human death is what really keep me reading them. Anyway, enough of this introspective nonsense. Here is a selection of books I really liked in 2018 (I suck at describing books, so quotes are from goodreads and the comments below are mine).

The murderbot diaries

On a distant planet, a team of scientists are conducting surface tests, shadowed by their Company-supplied ‘droid — a self-aware SecUnit that has hacked its own governor module, and refers to itself (though never out loud) as “Murderbot.” Scornful of humans, all it really wants is to be left alone long enough to figure out who it is.

Follows a rogue robot, who despite his acerbic comments and infinite cynicism, makes a nicer human than most.

Bobiverse

Bob Johansson has just sold his software company and is looking forward to a life of leisure. There are places to go, books to read, and movies to watch. So it’s a little unfair when he gets himself killed crossing the street. Bob wakes up a century later to find that corpsicles have been declared to be without rights, and he is now the property of the state. He has been uploaded into computer hardware and is slated to be the controlling AI in an interstellar probe looking for habitable planets.

This book is the ultimate nerd novel. Prepare to hear a lot about Dyson spheres and von Neumann probe.

Imperial Radch

On a remote, icy planet, the soldier known as Breq is drawing closer to completing her quest. Once, she was the Justice of Toren - a colossal starship with an artificial intelligence linking thousands of soldiers in the service of the Radch, the empire that conquered the galaxy. Now, an act of treachery has ripped it all away, leaving her with one fragile human body, unanswered questions, and a burning desire for vengeance.

Again a story about a rogue UI. The pace is a bit strange, with the second book in the series focusing on the politics of a single space station.

Homo Deus: A brief history of Tomorrow

Homo Deus explores the projects, dreams and nightmares that will shape the twenty-first century—from overcoming death to creating artificial life. It asks the fundamental questions: Where do we go from here? And how will we protect this fragile world from our own destructive powers? This is the next stage of evolution. This is Homo Deus.

Well, it is not real science fiction. The author presents what he believes will be the major problems of our society of tomorrow. It is the duty of historian and philosophers to remind people of the danger of technology he argues, because engineers focus mostly on the creation of technology and not on the consequences.

House of Suns

Six million years ago, at the very dawn of the starfaring era, Abigail Gentian fractured herself into a thousand male and female clones: the shatterlings. Sent out into the galaxy, these shatterlings have stood aloof as they document the rise and fall of countless human empires. They meet every two hundred thousand years, to exchange news and memories of their travels with their siblings. Campion and Purslane are not only late for their thirty-second reunion, but they have brought along an amnesiac golden robot for a guest. But the wayward shatterlings get more than the scolding they expect: they face the discovery that someone has a very serious grudge against the Gentian line, and there is a very real possibility of traitors in their midst. The surviving shatterlings have to dodge exotic weapons while they regroup to try to solve the mystery of who is persecuting them, and why - before their ancient line is wiped out of existence, forever.

I read The Prefect from Reynolds before, but while it was a nice read, it didn’t gave me the irresistible need to read more of his work. House of Suns however, was full of surprises and epic mysteries. A must read.

Ethereum Virtual Machine in Rust - Part 3: Looping and Functioning

Looping and calling Functions in the EVM!

Last time, we saw how to add two numbers and how to execute a If-Else statement using the stack in the EVM. This time, I want to continue learning about how the stack works by taking a look at loops. I’ll follow the same methodology as before:

  • First, compile and read the binary from solc
  • Implement an interpreter in Rust

Once you’ve built a mental model about how the stack works, reading binary data is not that complicated. It takes some effort to follow the evolution of the stack but everything is well documented.

Compiling a loop

The smart contract is super simple:

pragma solidity ^0.4.0;

contract TestingStuff {

	uint public x;

    function add() public {
        uint sum = 0;
        for (uint i = 0; i < 10; i++) {
            sum += i;
        }
        x = sum;
    }
}

Maybe at the end of the serie I’ll try something useful.

By the way, a for loop can be written as a while loop easily.

{
    uint i = 0;
    while (i < 10) {
        sum += i
        i++;
    }
}

Just by looking at the code and thinking the basic elements of a IF statement, we can guess how this will be compiled.

  • First, we need to initialize i. Push 0 on the stack.
  • Then, we need to compare with 0xA.
  • Using the JUMPI instruction, we can either execute the loop body or skip it.
  • We execute the loop body
  • We increase i at the end of the loop body
  • we jump back to the condition

The missing block here is the JUMP back.

Indeed, when compiling, there is not so many surprises:

0x91    PUSH1   Place 1-byte item on the stack 0x0
0x93    DUP1    Duplicate 1st stack item.
0x94    PUSH1   Place 1-byte item on the stack 0x0
0x96    SWAP2   Swap 1st and 3rd stack items.
0x97    POP     Remove item from stack
0x98    PUSH1   Place 1-byte item on the stack 0x0
0x9a    SWAP1   Swap 1st and 2nd stack items.
0x9b    POP     Remove item from stack
0x9c    JUMPDES  Destination
0x9d    PUSH1   Place 1-byte item on the stack 0xa
0x9f    DUP2    Duplicate 2nd stack item.
0xa0    LT      Less-than comparison
0xa1    ISZERO  Simple not operator.
0xa2    PUSH1   Place 1-byte item on the stack 0xb5
0xa4    JUMPI   Conditional jump to destination
0xa5    DUP1    Duplicate 1st stack item.
0xa6    DUP3    Duplicate 3rd stack item.
0xa7    ADD     Addition operation
0xa8    SWAP2   Swap 1st and 3rd stack items.
0xa9    POP     Remove item from stack
0xaa    DUP1    Duplicate 1st stack item.
0xab    DUP1    Duplicate 1st stack item.
0xac    PUSH1   Place 1-byte item on the stack 0x1
0xae    ADD     Addition operation
0xaf    SWAP2   Swap 1st and 3rd stack items.
0xb0    POP     Remove item from stack
0xb1    POP     Remove item from stack
0xb2    PUSH1   Place 1-byte item on the stack 0x9c
0xb4    JUMP     Jump to destination
0xb5    JUMPDES  Destination
0xb6    DUP2    Duplicate 2nd stack item.
0xb7    PUSH1   Place 1-byte item on the stack 0x0
0xb9    DUP2    Duplicate 2nd stack item.
0xba    SWAP1   Swap 1st and 2nd stack items.
0xbb    SSTORE  Save word to storage
  • 0x91 to 0x9b: i is initialized.
  • 0x9c: Set an anchor for a jump.
  • 0x9d to 0xA2: This is a condition statement, similar to if.
  • 0xa2: Address for after the loop
  • 0xa4: JUMPI which will check if we need to loop again
  • 0xa5 to 0xb1: Body of the loop
  • 0xb2: Push the loop condition address to the stack
  • 0xb4: Jump to the address located in the first item of the stack
  • 0xb5 to 0xbb: After the loop.

Testing in the VM

The only missing part in our Rust VM is the implementation of JUMP. This is even more simple than JUMPI as we just need to pop the next value of pc from the stack.

Opcode::JUMP(_addr) => {
    let jump_location = self.stack.pop().unwrap();
    self.pc = jump_location.as_u64() as usize;
},

Creating binary code by hand is beginning to be a bit of work so I’ll just implement some missing opcodes such as DUP, SWAP and ISZERO and I’ll use the code compiled by solc after cleaning it up a bit (need to replace the JUMP address by the real ones).

5b60008060009150600090505b600a81101560255780820191508080600101915050600c565b81600081905550505000

Interlude

Code begins to be more and more complex so I took some time to create a step by step debugger.

First, add this to the Vm implementation. It can be completed later with more debugging statements.

    // see part 2 for print_stack
    pub fn print_stack(&self) {
        self.stack
            .iter()
            .enumerate()
            .rev()
            .for_each(|(i, x)| {
                let mut bytes = vec![0;32];
                x.to_big_endian(&mut bytes);
                println!("|{}:\t{:?}|", i, bytes)
            });
    }

    pub fn print_debug(&self) {
        println!("pc:{}\n", self.pc);
        println!("Stack:");
        self.print_stack();
    }

Then, main.rs will be modified to execute this:

use std::io;

fn debug(vm: &mut Vm) {

    loop {
        if vm.at_end {
            break;
        }

        // Debugger.
        // c to continue
        // s to print stack
        // q to quit
        let mut input = String::new();
        io::stdin().read_line(&mut input).ok().expect("Couldn't read line");

        match &*input {
            "c\n" => vm.interpret(),
            "s\n" => vm.print_debug(),
            "q\n" => break,
            _ => println!("Please type either c, s or q"), 
        }
    }

}

Using this function, I can execute the bytecode step by step and take a look at the current VM state.

Running it!

Mmh, there are a few surprises here (I forgot to implement POP and had a bug in my JUMPI). When decomposed, a for loop is quite simple to implement using the EVM opcodes.

Pure functions

Just taking a look at for loops makes a light blog post so I’ll also talk about pure functions. According to solidity documentation,

Functions can be declared pure, in which case they promise not to read from or modify the state.

Looks like a simple first step into understand functions in the compiled code.

pragma solidity ^0.4.0;

contract TestingStuff {

	uint public x;

    function f(uint a, uint b) public pure returns (uint) {
        return a * (b + 42);
    }
    
    function add() public {
        x = f(1, 4);
    }
}

What could go wrong.

A quick peak at the compile code

What I want to find in the compile code is:

  • How to store the parameters of a function
  • How to call the function
  • How to return a value So let’s try to find that out.
0xef	JUMPDES	 Destination
0xf0	PUSH1	Place 1-byte item on the stack 0x0
0xf2	PUSH1	Place 1-byte item on the stack 0x2a
0xf4	DUP3	Duplicate 3rd stack item.
0xf5	ADD	Addition operation
0xf6	DUP4	Duplicate 4th stack item.
0xf7	MUL	Multiplication operation
0xf8	SWAP1	Swap 1st and 2nd stack items.
0xf9	POP	Remove item from stack
0xfa	SWAP3	Swap 1st and 4th stack items.
0xfb	SWAP2	Swap 1st and 3rd stack items.
0xfc	POP	Remove item from stack
0xfd	POP	Remove item from stack
0xfe	JUMP	 Jump to destination
0xff	JUMPDES	 Destination
0x100	PUSH2	Place 2-bytes item on the stack 0x1 0xb
0x103	PUSH1	Place 1-byte item on the stack 0x1
0x105	PUSH1	Place 1-byte item on the stack 0x4
0x107	PUSH2	Place 2-bytes item on the stack 0x0 0xef
0x10a	JUMP	 Jump to destination
0x10b	JUMPDES	 Destination
0x10c	PUSH1	Place 1-byte item on the stack 0x0
0x10e	DUP2	Duplicate 2nd stack item.
0x10f	SWAP1	Swap 1st and 2nd stack items.
0x110	SSTORE	Save word to storage
0x111	POP	Remove item from stack
0x112	JUMP	 Jump to destination
0x113	STOP	Halts execution

No biggies here. The function starts at address 0x100. Let’s take a look at the stack during the execution.

[0x10b]
[0x10b, 0x01]
[0x10b, 0x01, 0x04]
[0x10b, 0x01, 0x04, 0xef]

This is the stack before 0x10a (JUMP). The address of where to continue after calling the function is first put on the stack. Then the two function parameters are added, then the address of the code for the function is pushed on the stack so that we can call jump.

[0x10b, 0x01, 0x04] // Call jump

At this point, we are at address 0xef
[0x10b, 0x01, 0x04, 0x0]
[0x10b, 0x01, 0x04, 0x0, 0x2a]
[0x10b, 0x01, 0x04, 0x0, 0x2a, 0x4]
[0x10b, 0x01, 0x04, 0x0, 0x2e]
[0x10b, 0x01, 0x04, 0x0, 0x2e, 0x01]
[0x10b, 0x01, 0x04, 0x0, 0x2e]
[0x10b, 0x01, 0x04, 0x2e, 0x0]
[0x10b, 0x01, 0x04, 0x2e]
[0x2e, 0x01, 0x04, 0x10b] // swap 1st and 4th
[0x2e, 0x10b, 0x04, 0x01] // swap 1st and 3rd 
[0x2e, 0x10b, 0x04] 
[0x2e, 0x10b] 

A lot of swap, dup is going on but essentially the calculation is made. Then, by swapping element and popping the intermediate variables, the stack end up with the return value in second position and the address of where to return in the first position.

After jumping, it will look like:

[0x2e]
[0x2e, 0x0]
[0x2e, 0x0, 0x2e]
[0x2e, 0x2e, 0x0]

Then, the store instruction is executed as expected.

The funny thing here is that I do not have to add anything to my Rust code in order to make it work for these kind of functions. A pure function is just a clever combinaison of JUMPs, PUSH and SWAP:

  • Return location is pushed
  • Arguments are pushed in order
  • Function location is pushed then we jump to it
  • Body of the function is executed
  • Return location is put at the top of the stack, return value is right next behind
  • We jump to the return location. Value at the top of the stack is the function return value. Easy peasy.

Chip8 emulation, 3 take-aways!

Chip 8 Emulator

Nostalgia is big in the gaming scene so it is not a wonder that emulators of all sort would see the light. Gameboy, PS1, Nintendo 64… How did they work exactly was a bit a mystery to me before. When I found this article http://www.multigesture.net/articles/how-to-write-an-emulator-chip-8-interpreter/ , I couldn’t help but coding my own Chip8 emulator.

I won’t describe the code here as the original article does already a great job. However, there are a few points that I learnt and will definitively apply to my next emulator project (Gameboy/NES of course ! I need to play Zelda).

DRY

Emulation involves a lot of repetitive work. Opcodes sometimes are really similar (add constant to register, add two registers together). Also, the program counter needs to be increased most of the time after executing an opcode. All the common operations (fetching information from the opcode, increasing the opcode) should be factored in common functions in order to avoid bugs when writing the same thing a lot of times. I know, this is coding 101 but it does not hurt to repeat especially in that case where we can find a lot of code duplication.

Unit test that thing

Not everything can be testing in emulation. For example, graphics and sound might be tricky. Testing the opcodes individually however is easy and should be done to validate the correct behaviour.

From my emulator:

TEST(opcode, op_2nnn) {
    Chip8FreeAccess chip8;

    // 0x2204 - execute subroutine at index 204.
    std::vector<uint8_t> source {0x22, 0x04, 0xA2, 0xF0, 0xA2, 0xFF};
    chip8.load_from_buffer(source);
    chip8.emulateCycle();

    // stack should be one. pc would be stored as 0x200. Current pc is 0x204.
    ASSERT_EQ(1, chip8.sp());
    ASSERT_EQ(0x204, chip8.pc());
    ASSERT_EQ(0x200, chip8.stacks()[0]);

    // to confirm execute the next cycle
    chip8.emulateCycle();
    auto i = chip8.I();
    ASSERT_EQ(0x2FF, i);
}

This is testing that the subroutine opcode does the correct thing. I have similar tests for most opcodes (and should have for all of them…).

Testing all combination of opcodes is not possible but fortunately the emulation scene provides some ROM which are only used for testing emulators.

Create tools ASAP

Using gdb is great to debug programs but I also created some debugging tools to dump info in the emulator window at the same time as the game is running. If the code is stuck because of a bug, I can see immediately on my screen.

Also, I have some code to only print the opcodes in a binary file, not execute them. That way I can take my time to look at the code without running the game.

Especially when reverse-engineering (see Ethereum articles), taking a good look at the binary format saves a LOT of time so implementing tooling will be something I’ll do again next time.

I put my emulator on Github. It should work with most of the games in the repo although I didn’t test it extensively. My next project is a NES emulator, which is a tad more complication but I cannot wait to start. Cheers!

Ethereum Virtual Machine in Rust - Part 2: The stack

Memory model of Ethereum - The stack

Welcome to the second post of my Ethereum VM series. Part 1 was dedicated to the instruction set of the Ethereum VM and showed a simple way of printing instructions from a smart contract binary file. Now, I’ll talked more about the memory model of the VM and how to understand the code compiled by solc.

I’ll try to answer those questions:

  • How are additions, multiplications and subtractions implemented?
  • How to find the compiled instructions for a block of solidity code?
  • How are IF condition implemented?
  • Where is the smart contract data stored?

There is a lot to cover here. In this post, I’ll show what kind of operations are possible using the stack. There is a lot to cover so maybe this post will be split in two…

VM specifications

The VM is described in the yellow paper, part 9: Yellow paper

The word size of the machine (and thus size of stack items0 is 256-bits.

32 bytes word size is quite large. Yellow paper mentioned it is to facilitate the Keccak-256 hash scheme and elliptic-curve computations. Hum. Let’s see later if we can find out what they mean.

The EVM is a simple stack-based architecture.

Stack-based architectures carry out their operation by pushing and popping items on a stack. A stack is just a LIFO (last in, first out) container. For example, a + b can be done by:

  • pushing a on the stack
  • pushing b on the stack
  • pop a and pop b
  • push a+b on the stack

The memory model is a simple word-addressed byte array.

Memory is a random-access byte array. We read a word at a time (32 bytes).

Unlike memory which is volatile, storage is non volatile and is maintained as part of the system state. This is where the fun happen. The data is persisted between transactions in this storage. It is stored under the contract’s account.

Fun with the stack

According to Rust official documentation, the best way to model a stack is just to use a vector. Vec has push and pop operations so it fits well with our use case. Unfortunately, the size of the stack items is 256 bits, which is not standard in Rust so I needed to use an external package (I don’t really want to implement 256-bits arithmetic myself…).

The bigint package looks pretty complete: BigInt documentation We’ll have to be careful with endianess here… (Ah, can’t avoid this problem forever)

Starting from part 1 implementation, our VM will become:

struct Vm {
    code: Vec<u8>, // This is the smart contract code.
    pc: usize, 
    stack: Vec<U256>,
}

And the factory function:

    fn new_from_file(filename: &str) -> Result<Vm, Box<Error>> {
        let mut f = File::open(filename)?;
        let mut buffer = String::new();
        f.read_to_string(&mut buffer)?;

        let code = decode(&buffer)?;
        Ok(Vm { code: code, pc: 0, stack: Vec::new()})
    }

Additions

As mentioned above, an addition implemented with a stack can be with the following operations:

  • PUSH left operand
  • PUSH right operand
  • POP 2 items from stack and add them
  • PUSH result

It can be easily translated to Ethereum instructions.

  • PUSH1 left
  • PUSH1 right
  • ADD

For example, if we want to add 2 and 5, the binary code would be: 0x60 0x02 0x60 0x05 0x01 PUSH 2 PUSH 5 ADD

Let’s save the string “6002600501” to a file and use it for our testing.

I’ll create an interpret function that will read the next instruction and apply it to the VM.


    fn interpret(&mut self) {
        let maybe_op = self.next();

        // just for debugging
        match &maybe_op {
            Some(x) => x.describe(),
            None => {}
        }

        // The real execution
        match &maybe_op {
            Some(x) => {
                match x {
                Opcode::PUSH1(addr, value) => {
                    // Value is u8, we need to push a u256 on the stack...
                    self.stack.push(U256::from(*value));
                },
                Opcode::ADD(addr) => {
                    // How to recover nicely? There is no meaning in recovering nicely here.
                    // Just burn and crash if cannot pop.
                    let v1 = self.stack.pop().unwrap();
                    let v2 = self.stack.pop().unwrap();
                    self.stack.push(v1 + v2);
                },
                _ => {
                    // not implemented, just chill
                }
                }
            },
            None => {}
        }
    }

For pushing a u8 integer on the stack, we can just convert it to U256 and use the built-in push method of vector. Popping is also built-in, but it returns an Option. If the option is None at this point, there is NO way to recover gracefully as something must have gone wrong with the compilation. Unwrap looks like a correct choice in that case.

By the way, I am printing the instruction description before execution. It’s always a good idea to add this kind of debug print during development. It is easier to find out when something goes wrong. In the same spirit, I’ll add a function to print the stack.


    fn print_stack(&self) {
        self.stack
            .iter()
            .enumerate()
            .rev()
            .for_each(|(i, x)| {
                let mut bytes = vec![0;32];
                x.to_big_endian(&mut bytes);
                println!("|{}:\t{:?}|", i, bytes)
            });
    }

Here again, chaining iterator methods really makes my day :).

My main.rs is growing quite a lot so I’ll separate the code in three files:

  • main.rs: Entry point. Can either print the instructions or execute the code
  • evm/vm.rs: VM code and execution
  • evm/opcode.rs: Opcode enumeration and debug print function

Main file will look like that:

mod evm;
use evm::vm::Vm;
use evm::opcode::Opcode;
use std::error::Error;
use std::env;

fn debug(vm: &mut Vm) {
    loop {
        match vm.next() {
            Opcode::EOF => break,
            x => x.describe(),
        }
    }
}

fn interpret(vm: &mut Vm) {
    while !vm.at_end {
        vm.interpret();
    }
    vm.print_stack();
}

fn run() -> Result<(), Box<Error>> {

    let args: Vec<String> = env::args().collect();
    let function = args[1].clone();
    let filename = args[2].clone();

    println!("In file {}", filename);
 
    let mut vm = Vm::new_from_file(&filename)?;
    println!("Correctly loaded VM");

    match &*function {
        "debug" => debug(&mut vm),
        "run" => interpret(&mut vm),
        _ => panic!("Expect either 'debug' or 'run' for first parameter")
    }
    Ok(())
}

fn main() {
    run().unwrap();
}

Also, I did a bit of refactoring for printing the Opcode. Instead of returning Option<Opcode> from the next method, I’ll return Opcode directly. I also add a new Opcode for debugging called UNKNOWN which contains the opcode that is not implemented yet.

A word on unit testing

Previously I was just struggling to get my program working. Now that I am a bit more confident in Rust (or is it an illusion), it is time to add some unit testing.

The convention in Rust is to put unit tests for a module into a submodule. Basically, it comes to adding the following at the end of each file.

#[cfg(test)]
mod test {
    // unit tests here
}

For my vm code, I expect the code to grow quite a lot considering the number of opcode to implement. I’ll move the tests to another file. That was actually more annoying to do than I initially though. If you have a better way to do this, please add a comment.

The Vm fields need to be public. The test file is in evm/vm_test.rs and the module is added in evm/mod.rs.


// evm/mod.rs
#[cfg(test)]
mod vm_test;


// evm/vm_test.rs
use super::vm::Vm;
#[test]
fn sometest() {

}

To test the addition, we can create a fake binary array (Vec), create a new VM and execute the code. After execution, we can access the state of the VM and verify that the number on the stack corresponds to our expectations.

use super::vm::Vm;
fn create_vm(binary: Vec<u8>) -> Vm {
    Vm { code: binary, pc: 0, stack: Vec::new(), at_end: false}
}

#[test]
fn addition() {
    // 1 + 5
    let binary = vec![0x60, 0x0f, 0x60, 0x01, 0x01, 0x00];
    let mut vm = create_vm(binary); //moved

    // execute three instructions.
    // push 0x0f
    vm.interpret();
    // push 0x01
    vm.interpret();
    // add
    vm.interpret();
    // halt
    vm.interpret();

    // Now make sure the stack size is 1 and contains 16.
    assert_eq!(1, vm.stack.len());
    assert_eq!(16, vm.stack[0].as_u32()); // this is panicking in case of overflow.
}

Find the addition in our compile Ethereum contract

Let’s compile this code:

pragma solidity ^0.4.0;

contract Addition{

	int public x;
    
    function add() public {
        int a = 5;
        int b = 6;
    	x = a + b;
   	}
}

We should see these instructions:

  • Initialize a and b
  • Add a and b in x
  • store x

Looking directly at the compiled output is hard. Solc can generate intermediate code by passing the –asm flag. So let’s run solc --bin-runtime --overwrite --asm --optimize -o . contract.sol and examine the .evm file that was created.

Solc is nice and annotate the .evm file. I can find these lines:

    tag_13:
        /* "contract.sol":170:175  a + b */
      0xb
        /* "contract.sol":119:124  int a */
      0x0
        /* "contract.sol":166:175  x = a + b */
      sstore
        /* "contract.sol":87:182  function add() public {... */
      jump      // out

What happens here?

  • 0xb is a shorthand for “PUSH 0xb” on the stack
  • 0x0 is a shorthand for “PUSH 0x0” on the stack
  • sstore will pop two values from the stack. The first one will be the key, the second one will be the value. It will then store Key=>Value in persistent storage.

This code is just storing the amount 11 at address 0 of the persistent storage. This is optimized code so it detected at compile time that a + b will always be 11. If you remove the unoptimized code, you’ll notice that the output is less pretty:

  tag_13:
        /* "contract.sol":119:124  int a */
      0x0
        /* "contract.sol":138:143  int b */
      dup1
        /* "contract.sol":127:128  5 */
      0x5
        /* "contract.sol":119:128  int a = 5 */
      swap2
      pop
        /* "contract.sol":146:147  6 */
      0x6
        /* "contract.sol":138:147  int b = 6 */
      swap1
      pop
        /* "contract.sol":174:175  b */
      dup1
        /* "contract.sol":170:171  a */
      dup3
        /* "contract.sol":170:175  a + b */
      add
        /* "contract.sol":166:167  x */
      0x0
        /* "contract.sol":166:175  x = a + b */
      dup2
      swap1
      sstore
      pop
        /* "contract.sol":87:182  function add() public {... */
      pop
      pop
      jump      // out

There are a lot of superfluous instructions here. Swap, pop and dup are just stack operations. If we follow the instructions and print the stack, it would look like the following:

[0] // int a
[0, 0] // int b
[0, 0, 5] // 5
[5, 0, 0] // a = 5
[5, 0] // pop
[5, 0, 6] // 6
[5, 6, 0] // b = 6
[5, 6] // pop
[5, 6, 6] // b - dup1
[5, 6, 6, 5] // a = dup3
[5, 6, 11] // a + b
[5, 6, 11, 0]
[5, 6, 11, 0, 11] 
[5, 6, 11, 11, 0]
[5, 6, 11] // store. here it will store 11 at address 0
[5, 6] // pop
[5] // pop
[] // pop

We can see a bit the logic behind the compiler. For example, to initialize an integer, it will first push 0 on the stack. When assigning the value, it will push the value then swap with the 0 that corresponds to this integer. It will then pop the extra 0.

Let’s find the corresponding opcodes for this block of intermediate code. I’ll first add the POP, DUP, SWAP and SSTORE opcodes to my Rust code. Then after compiling and running the debug program, I can find the following:

0xd9    JUMPDES  Destination
0xda    PUSH1   Place 1-byte item on the stack 0x0
0xdc    DUP1    Duplicate 1st stack item.
0xdd    PUSH1   Place 1-byte item on the stack 0x5
0xdf    SWAP2   Swap 1st and 3rd stack items.
0xe0    POP     Remove item from stack
0xe1    PUSH1   Place 1-byte item on the stack 0x6
0xe3    SWAP1   Swap 1st and 2nd stack items.
0xe4    POP     Remove item from stack
0xe5    DUP1    Duplicate 1st stack item.
0xe6    DUP3    Duplicate 3rd stack item.
0xe7    ADD     Addition operation
0xe8    PUSH1   Place 1-byte item on the stack 0x0
0xea    DUP2    Duplicate 2nd stack item.
0xeb    SWAP1   Swap 1st and 2nd stack items.
0xec    SSTORE  Save word to storage
0xed    POP     Remove item from stack
0xee    POP     Remove item from stack
0xef    POP     Remove item from stack
0xf0    JUMP     Jump to destination
0xf1    STOP    Halts execution

This map quite well to the .ecm file ;)

Flow control - If and Loops

At this point, I am not so sure about what happen when compiling an if statement. I assume the opcodes for this feature would be a combinaison of one of the logic operation (LT, GT, …) and JUMP opcode to go to change pc to the correct block of instructions. To be certain, I’ll change my previous solidity program to include a simple if-else statement and I’ll take a look at the diff between the two compiled programs.

pragma solidity ^0.4.0;

contract Addition{

	int public x;
    bool public large;
    function add() public {
        int a = 5;
        int b = 6;
    	x = a + b;

        if (x < 5) {
            x = 5;
        } else {
            x = 7;
        }
   	}
}

Again, this smart contract is completely useless. Anyway, the output of solc gives us this:

      0x0
        /* "contract.sol":154:163  x = a + b */
      dup2
      swap1
      sstore
      pop
        /* "contract.sol":182:183  5 */
      0x5
        /* "contract.sol":178:179  x */
      sload(0x0)
        /* "contract.sol":178:183  x < 5 */
      slt
        /* "contract.sol":174:251  if (x < 5) {... */
      iszero
      tag_15
      jumpi
        /* "contract.sol":203:204  5 */
      0x5
        /* "contract.sol":199:200  x */
      0x0
        /* "contract.sol":199:204  x = 5 */
      dup2
      swap1
      sstore
      pop
        /* "contract.sol":174:251  if (x < 5) {... */
      jump(tag_16)
    tag_15:
        /* "contract.sol":239:240  7 */
      0x7
        /* "contract.sol":235:236  x */
      0x0
        /* "contract.sol":235:240  x = 7 */
      dup2
      swap1
      sstore
      pop
        /* "contract.sol":174:251  if (x < 5) {... */
    tag_16:
        /* "contract.sol":87:257  function add() public {... */
      pop
      pop
      jump      // out

The opcode SLOAD, SLT, ISZERO and JUMPI are used for this IF statement.

  • sload(0x00) will just get the value stored at address 0x00 (it is x!)
  • slt will add 1 to the stack if first value of stack is strictly less than the second value of the stack
  • iszero pop a value and push 1 if the value is zero, 0 otherwise (simple not operator)
  • jumpi will jump to the address stored at second place in the stack in the first value of the stack is not 0.

In our case, the stack will look like (after SLT) [1] // SLT [0] // ISZERO [0, Address of tag 15] // tag_15 [] // just increment pc as usual. No jump here.

Then assign 5 to x. If SLT was false, then JUMPI would have set pc to tag_15 and we would have assigned 7 to x instead.

In opcode, this translates to:

0xfb    PUSH1   Place 1-byte item on the stack 0x5
0xfd    PUSH1   Place 1-byte item on the stack 0x0
0xff    SLOAD   Load word from storage
0x100   SLT     Signed less-than comparison
0x101   ISZERO  Simple not operator.
0x102   PUSH2   Place 2-bytes item on the stack 0x1 0x12
0x105   JUMPI   Conditional jump to destination
0x106   PUSH1   Place 1-byte item on the stack 0x5
0x108   PUSH1   Place 1-byte item on the stack 0x0
0x10a   DUP2    Duplicate 2nd stack item.
0x10b   SWAP1   Swap 1st and 2nd stack items.
0x10c   SSTORE  Save word to storage
0x10d   POP     Remove item from stack
0x10e   PUSH2   Place 2-bytes item on the stack 0x1 0x1b
0x111   JUMP     Jump to destination
0x112   JUMPDES  Destination
0x113   PUSH1   Place 1-byte item on the stack 0x7
0x115   PUSH1   Place 1-byte item on the stack 0x0
0x117   DUP2    Duplicate 2nd stack item.
0x118   SWAP1   Swap 1st and 2nd stack items.
0x119   SSTORE  Save word to storage
0x11a   POP     Remove item from stack
0x11b   JUMPDES  Destination
0x11c   POP     Remove item from stack
0x11d   POP     Remove item from stack
0x11e   JUMP     Jump to destination
0x11f   STOP    Halts execution

Note that the “tag_15” was actually replaced by pushing 2 bytes to the stack (0x11b). When we take a look at line 0x11b, it actually corresponds to a JUMPDEST.

Great! We made a great progression in our understanding of the IF statement. First, we need to set 0 or 1 on the stack depending on what condition we choose. Then, we need to push to address of the “then” block and use JUMPI. JUMPI will set the program counter to the address of the “then” block if the condition is 1. Otherwise, the program execution will continue as normal.

It should be noted that in the compile code, the “then” block in the binary file is located directly after the JUMPI instruction. In order to do so, an instruction ISZERO is added directly after the comparison. Why do it like that? One of the possible reason is code locality. The compiler assumes that the “then” block is executed more often than the “else” block so it makes sense to keep it closer in memory.

Intepreting a IF statement

Now that I have a clear image of how the If-else statement is implemented in the EVM, I’ll continue the Rust interpreter and implement the missing opcodes. The program to interpret will be:

0x60
0x07 // push 7
0x60
0x05 // push 5
0x12 // execute SLE (5 < 7)
0x60
0x0C // push 0x0C
0x57 // JumpI
0x60
0x01 // push 0x01
0xbb // My special instruction
0x00 // Halt.
0x5b // Jumpdest
0x60
0x02 // push 0x02
0xbb

0xbb is Benoit instruction which is my special instruction. It will pop and print an item from the stack. This code should print “1” as 5 < 7.

The opcodes’ implementation is the following:

// in vm.rs:interpret
            Opcode::SLT(_addr) => {
                let lhs = self.stack.pop().unwrap();
                let rhs = self.stack.pop().unwrap();
                if lhs < rhs {
                    self.stack.push(U256::from(0x01));
                } else {
                    self.stack.push(U256::from(0x00));
                }
            },
            Opcode::JUMPI(_addr) => {
                let then_addr = self.stack.pop().unwrap();
                let cond = self.stack.pop().unwrap();
                if !cond.is_zero() {
                    self.pc = then_addr.as_u64() as usize;
                } // else continue to next :)
            }
            Opcode::PRINT(_addr) => {
                // TODO this should be removed in release mode..
                let v = self.stack.pop().unwrap();
                let mut bytes = vec![0;32];
                v.to_big_endian(&mut bytes);
                println!("PRINT\t{:?}|", bytes)

            },

If you run the program, it should display “PRINT” followed by 1. You can also change the operand to see that the else block can be reached.

At this point, it would be worth it to implement some tooling. For example, a step by step debugger would be greatly useful.

Wrapping up

That’s a bunch of words already. We saw how the stack of the EVM was used to perform basic arithmetic and flow operations. We saw how to compile the code and map the solidity code to the evm code. We saw how to add unit tests to validate the behaviour of the VM. Phew!

There are still some missing flow operations that I’d like to explore. In particular, I want to review the loop and the functions. This is going to be the topic of the next part of this series.

Ethereum Virtual Machine in Rust - Part 1: Introduction

Exploring Ethereum Virtual Machine

Welcome to the great series - if everything goes well - about the Ethereum Virtual machine. Lately I have been growing fond of emulation and cryptocurrency so I decided to take a look at the inside of Ethereum: the virtual machine that is executing smart contracts.

And while implementing a fake EVM (Ethereum Virtual Machine) can be a challenge, why not do it in a language I have absolutely no experience in. After these few lines, I realize the probabilities I finish this serie of blog post is meager at best.

Anyway let’s get started. In this first part I’ll talk about compiling a solidity contract to binary file, and creating a small program to read each instruction.

Our Amazing smart contract

Take a look at this beauty.

pragma solidity ^0.4.0;

contract Addition{

	int public x;
    
    function add(int a, int b) public {
    	x = a + b;
   	}
}

It’s not doing much. Anybody can call add which will just store the addition of its argument in the ledger. Then anybody can read the value.

Then, run solc --bin-runtime --optimize -o . contract.sol to output the compiled contract binary code to the current directory.

On my computer, I get:

608060405260043610603e5763ffffffff7c0100000000000000000000000000000000000000000000000000000000600035041663a5f3c23b81146043575b600080fd5b348015604e57600080fd5b50605b600435602435605d565b005b016000555600a165627a7a723058204ff1427599e28990ab2413948c03501a48ab89d18888ac7d0205c12f443424070029

By the way, this is an hexadecimal string. This is going to be important when we read it from Rust (https://solidity.readthedocs.io/en/v0.4.21/using-the-compiler.html)

Reading the binary file

Rust makes it quite easy to read data from a file.

use std::fs::File;
use std::io::prelude::*;
use std::error::Error;

fn run() -> Result<(), Box<Error>> {
    let filename = "myfilename";
    
    let mut f = File::open(filename)?;
    let mut buffer = String::new();
    f.read_to_string(&mut buffer)?;
    
    println!("{}", buffer);
}

fn main() {
    run().unwrap();
}

This is going to print the content of our file. Right now, the whole code is loaded in the string. There might be better way to do it but ultimately the code is always loaded in the RAM of the EVM. Also, the smart contracts tend to be small so it should be no problem to do it like that.

Opcode and Virtual machine execution

A virtual machine, such as the Java Virtual Machine, will take a set of instructions and execute them. in Java the set of instruction is in the Java bytecode (https://en.wikipedia.org/wiki/Java_bytecode). In Ethereum, the set of instruction is the content of the binary file we just read. The instruction set is described in the Ethereum Yellow paper, in the appendix. https://ethereum.github.io/yellowpaper/paper.pdf

For example, the value 0x30 corresponds to the ADDRESS instruction (or opcode) and tells the ethereum VM to get the address of currently executed account. Let’s not worry about all the technical terms for now. Just keep in mind that our binary file is a set of instructions, and each instructions are 8 bits, or 1 byte, long.

Our file content is an hexadecimal string, so we can represent the first bytes as: 0x60 0x80 0x60 0x40 0x52 If you look at the specifications, you can convert to a more readable format:

  • PUSH one byte to stack: 0x80
  • PUSH one byte to stack: 0x40
  • STORE one word to memory

Next step in our program is to convert our string to a list of bytes (u8 in Rust).

use std::num::ParseIntError;

fn decode(s: &str) -> Result<Vec<u8>, ParseIntError> {
    (0..(s.len()-1))
        .step_by(2)
        .map(|i| u8::from_str_radix(&s[i..i+2], 16))
        .collect()
}

It will construct a vector of bytes. Function composition here makes the code very expressive:

  • We get an iterator from 0 to the length of our string (excluded)
  • We get another iterator that will yield the first iterator elements with a step of 2. (Yield 0, 2, 4, 6… and so on)
  • Then, for each index, we apply u8::from_str_radix to the slice [i..i+2] of our input string. This is going to convert the string to a u8 integer, base 16. This can panic in case the string does not represent a valid integer base 16 (‘P0’ would panic)
  • We consume the iterators with collect.

Iterators are lazy in Rust, so we need to call collect at the end to consume them. See here for more details: https://doc.rust-lang.org/book/second-edition/ch13-02-iterators.html

The Result trait implements FromIter, so instead of returning Vec<Result<u8, ParseIntError>>, we can return Result<Vec<u8>, ParseIntError>. See also here https://doc.rust-lang.org/std/result/enum.Result.html#method.from_iter

Now we can print our bytes from the files.


fn run() -> Result<(), Box<Error>> {
    let filename = "myfilename";
    
    let mut f = File::open(filename)?;
    let mut buffer = String::new();
    f.read_to_string(&mut buffer)?;
    
    let bytes = decode(&buffer)?;
    
    for b in &bytes {
        println!("0x{:x}", b) 
    }
    println!("{}", buffer);
}

Our simple EVM

Now that we can read the bytes of our compiled smart contract, let’s start the implementation of the VM. I am just going to show how to print debug information about the instructions in this part. We are going to iterate over the list of instructions and print what they mean. No stack or memory involved here :)

The most basic VM will hold the code in memory, and will iterate through it. We could use range-based loop to do the iteration, but you will see later that it won’t work nicely with our use case. For example, not all instructions are only one byte long. Some, such as PUSH3, are 4 bytes long (one byte for the instruction value, and 3 bytes for the value to push to the stack).

For that reason, I am going to keep an index of the current instruction in the code vector. This index is often called pc, for program counter.


struct Vm {
    code: Vec<u8>, // smart contract code
    pc: usize, // current instruction
}

impl Vm {
    fn new_from_file(filename: &str) -> Result<Vm, Box<Error>> {
        let mut f = File::open(filename)?;
        let mut buffer = String::new();
        f.read_to_string(&mut buffer)?;

        let code = decode(&buffer)?;
        Ok(Vm { code: code, pc: 0})
    }
}

new_from_file will initialize us a new VM. Then, we need a way to model the instruction. I found enumeration in Rust well suited for this as they are very flexible. I also really like the pattern matching with enumerations.

Let’s get started. There are more than an hundred instructions in the EVM instruction set so I’ll just show a few of them.


enum Opcode {

    STOP, // 0x00
    ADD, // 0x01
    MUL, // 0x02
    
    PUSH1(u8), // 0x60
    PUSH2(u8, u8), // 0x61
    
    PUSH32(u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8), // 0x7f 
}

PUSH1(u8) means that the instruction is made of 2 bytes. First byte is the instruction value, second byte is the value to push to the stack. We are going to store this value in the enumeration.

Now that we have our Opcode enumeration, we need to yield opcodes from the list of bytes. This is called decoding. The next function will return the Opcode at the current pc, and then advance pc to the next instruction.

impl Vm {
    fn next(&mut self) -> Option<Opcode> {
        match self.code[self.pc] {
            0x00 => {
                self.pc += 1;
                Some(Opcode::STOP)
            },
            0x01 => {
                self.pc += 1;
                Some(Opcode::ADD)
            },
            0x02 => {
                self.pc += 1;
                Some(Opcode::MUL)
            },
            0x60 => {
                let value = self.code[self.pc+1];
                self.pc += 2;
                Some(Opcode::PUSH1(value))
            },
            Ox61 => {
                let value0 = self.code[self.pc+1];
                let value1 = self.code[self.pc+2];
                self.pc += 3;
                Some(Opcode::PUSH2(value0, value1))
            }
            _ => { self.pc += 1; None}
        }
    }
}

Right now, I ignore the error cases (for example, buffer overflow). Rust will panic if I try to access values with indexes that are larger that the size of the vector. Also, the code is a bit repetitive so there might be a better way to do it.

What next is doing is basic, but at the heart of the instruction decoding:

  • Get the current byte
  • Match it with an opcode
  • If the opcode needs additional data from the instruction vector, extract them
  • Move pc to the next instruction
  • Return Opcode if found

For the PUSH instructions, pc is incremented multiple times.

Peeking inside the code

next will give us the next instruction to execute. For now, I am just going to print it to the console without extra logic. This will be useful in the future to debug our code.

The easiest way is to make our enum derive from Debug. Then we’ll be able to print the enum name with println!.


#[derive(Debug)]
enum Opcode {
...
}

fn run() -> {

    ...
    
    loop {
        match vm.next() {
            Some(x) => println!("{:?}", x),
            None => {}
        }
    }
}

Just by doing so, I have the following output.

ADD PUSH1(0) STOP MUL STOP thread ‘main’ panicked at ‘index out of bounds: the len is 143 but the index is 143’, libcore/slice/mod.rs:2046:10

This is good, but we can do much better. First, just display the enumeration name and content does not give enough information. I’d like a description of the opcode and the address of the opcode in the binary. Second, this panic should not be there, as our program executed as expected.

To avoid the panic, I’ll add a secret Opcode (sshh) that signifies our program ends. We could break on None in the loop but I still want to intepret the whole code even if there are codes I haven’t implemented yet. So let’s add EOF to our Opcode enumeration. In reality, 0x00 will take care of that.

Before the match in next:

 if self.pc >= self.code.len() {
            return Some(Opcode::EOF);
        }

And in the match of the main loop:

    loop {
        match vm.next() {
            Some(Opcode::EOF) => break,
            Some(x) => println!("{:?}", x),
            None => {}
        }
    }

Happy days!

At last, I want to print more information. so let’s add the instruction number and the description for each enum.


#[derive(Debug)]
enum Opcode {
    STOP(usize), // 0x00
    ADD(usize), // 0x01
    MUL(usize), // 0x02

    PUSH1(usize, u8), // 0x60
    PUSH2(usize, u8, u8), // 0x61
    PUSH32(usize, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8), // 0x7f 

    EOF,
}

And next will become:

    fn next(&mut self) -> Option<Opcode> {
        if self.pc >= self.code.len() {
            return Some(Opcode::EOF);
        }

        let addr = self.pc;
        match self.code[addr] {
             0x00 => {
                self.pc += 1;
                Some(Opcode::STOP(addr))
            },
            0x01 => {
                self.pc += 1;
                Some(Opcode::ADD(addr))
            },
            0x02 => {
                self.pc += 1;
                Some(Opcode::MUL(addr))
            },
            0x60 => {
                let value = self.code[self.pc+1];
                self.pc += 2;
                Some(Opcode::PUSH1(addr, value))
            },
            0x61 => {
                let value0 = self.code[self.pc+1];
                let value1 = self.code[self.pc+2];
                self.pc += 3;
                Some(Opcode::PUSH2(addr, value0, value1))
            },
            _ => { self.pc += 1;  None}
        }
    }

Now we can create a function that will describe the opcode.


impl Opcode {
    fn describe(&self) {
        match self {
        Opcode::STOP(line) => println!("0x{:x}\tSTOP\tHalts execution", line),
        Opcode::ADD(line) => println!("0x{:x}\tADD\tAddition operation", line),
        Opcode::MUL(line) => println!("0x{:x}\tMUL\tMultiplication operation", line),
        Opcode::PUSH1(line, x) => println!("0x{:x}\tPUSH1\tPlace 1-byte item on the stack 0x{:x}", line, x),
        Opcode::PUSH2(line, x0, x1) => println!("0x{:x}\tPUSH2\tPlace 2-bytes item on the stack 0x{:x} 0x{:x}", line, x0, x1),
        _ => println!("Unknown opcode")
 
        }
    }
}


// update run function
fn run() -> Result<(), Box<Error>> {

    let filename = "Addition.bin-runtime";
    
    println!("In file {}", filename);

    let mut vm = Vm::new_from_file(&filename)?;

    loop {
        match vm.next() {
            Some(Opcode::EOF) => break,
            Some(x) => x.describe(),
            None => {}
        }
    }

    Ok(())
}

A final word

I showed in this article how to create a tool to display the instructions from an ethereum smart contract binary. It is not complete: the describe and next method have to be populated with all the opcodes in order to be complete.

In the next article I will introduce the memory layout of the EVM. We’ll talk about concepts such as stack, memory and persistent storage.