20 Nov 2018
I read a bunch of sci-fi this year. From light short stories to larger epopea.
One common thing between all those books is that they manage to dream about
all the possibilities the future can offer us:
- spaceships? I want.
- Rogue AI? Amazing.
- Rejuvenation? Downloading your thoughts to your brain? That is the real killer.
Maybe the fact that most of them offer a solution to human death is what really
keep me reading them. Anyway, enough of this introspective nonsense. Here is a
selection of books I really liked in 2018 (I suck at describing books, so
quotes are from goodreads and the comments below are mine).
The murderbot diaries
On a distant planet, a team of scientists are conducting surface tests,
shadowed by their Company-supplied ‘droid — a self-aware SecUnit that has
hacked its own governor module, and refers to itself (though never out loud)
as “Murderbot.” Scornful of humans, all it really wants is to be left
alone long enough to figure out who it is.
Follows a rogue robot, who despite his acerbic comments and infinite cynicism, makes
a nicer human than most.
Bobiverse
Bob Johansson has just sold his software company and is looking forward to a
life of leisure. There are places to go, books to read, and movies to watch.
So it’s a little unfair when he gets himself killed crossing the street.
Bob wakes up a century later to find that corpsicles have been declared to
be without rights, and he is now the property of the state. He has been
uploaded into computer hardware and is slated to be the controlling AI in
an interstellar probe looking for habitable planets.
This book is the ultimate nerd novel. Prepare to hear a lot about Dyson spheres and von Neumann
probe.
Imperial Radch
On a remote, icy planet, the soldier known as Breq is drawing closer to
completing her quest. Once, she was the Justice of Toren - a colossal
starship with an artificial intelligence linking thousands of soldiers
in the service of the Radch, the empire that conquered the galaxy. Now,
an act of treachery has ripped it all away, leaving her with one fragile
human body, unanswered questions, and a burning desire for vengeance.
Again a story about a rogue UI. The pace is a bit strange, with the second book in the series
focusing on the politics of a single space station.
Homo Deus: A brief history of Tomorrow
Homo Deus explores the projects, dreams and nightmares that will shape
the twenty-first century—from overcoming death to creating artificial
life. It asks the fundamental questions: Where do we go from here? And
how will we protect this fragile world from our own destructive powers?
This is the next stage of evolution. This is Homo Deus.
Well, it is not real science fiction. The author presents what he believes will be the major
problems of our society of tomorrow. It is the duty of historian and philosophers to remind
people of the danger of technology he argues, because engineers focus mostly on the creation
of technology and not on the consequences.
House of Suns
Six million years ago, at the very dawn of the starfaring era, Abigail Gentian
fractured herself into a thousand male and female clones: the shatterlings. Sent
out into the galaxy, these shatterlings have stood aloof as they document the rise
and fall of countless human empires. They meet every two hundred thousand years,
to exchange news and memories of their travels with their siblings.
Campion and Purslane are not only late for their thirty-second reunion, but they
have brought along an amnesiac golden robot for a guest. But the wayward shatterlings
get more than the scolding they expect: they face the discovery that someone has a
very serious grudge against the Gentian line, and there is a very real possibility
of traitors in their midst. The surviving shatterlings have to dodge exotic weapons
while they regroup to try to solve the mystery of who is persecuting them,
and why - before their ancient line is wiped out of existence, forever.
I read The Prefect from Reynolds before, but while it was a nice read, it didn’t gave me
the irresistible need to read more of his work. House of Suns however, was full of surprises
and epic mysteries. A must read.
17 Nov 2018
Looping and calling Functions in the EVM!
Last time, we saw how to add two numbers and how to execute a If-Else statement using the stack
in the EVM. This time, I want to continue learning about how the stack works by taking a look
at loops. I’ll follow the same methodology as before:
- First, compile and read the binary from solc
- Implement an interpreter in Rust
Once you’ve built a mental model about how the stack works, reading binary data is not that complicated.
It takes some effort to follow the evolution of the stack but everything is well documented.
Compiling a loop
The smart contract is super simple:
pragma solidity ^0.4.0;
contract TestingStuff {
uint public x;
function add() public {
uint sum = 0;
for (uint i = 0; i < 10; i++) {
sum += i;
}
x = sum;
}
}
Maybe at the end of the serie I’ll try something useful.
By the way, a for loop can be written as a while loop easily.
{
uint i = 0;
while (i < 10) {
sum += i
i++;
}
}
Just by looking at the code and thinking the basic elements of a IF statement, we can guess
how this will be compiled.
- First, we need to initialize i. Push 0 on the stack.
- Then, we need to compare with 0xA.
- Using the JUMPI instruction, we can either execute the loop body or skip it.
- We execute the loop body
- We increase i at the end of the loop body
- we jump back to the condition
The missing block here is the JUMP back.
Indeed, when compiling, there is not so many surprises:
0x91 PUSH1 Place 1-byte item on the stack 0x0
0x93 DUP1 Duplicate 1st stack item.
0x94 PUSH1 Place 1-byte item on the stack 0x0
0x96 SWAP2 Swap 1st and 3rd stack items.
0x97 POP Remove item from stack
0x98 PUSH1 Place 1-byte item on the stack 0x0
0x9a SWAP1 Swap 1st and 2nd stack items.
0x9b POP Remove item from stack
0x9c JUMPDES Destination
0x9d PUSH1 Place 1-byte item on the stack 0xa
0x9f DUP2 Duplicate 2nd stack item.
0xa0 LT Less-than comparison
0xa1 ISZERO Simple not operator.
0xa2 PUSH1 Place 1-byte item on the stack 0xb5
0xa4 JUMPI Conditional jump to destination
0xa5 DUP1 Duplicate 1st stack item.
0xa6 DUP3 Duplicate 3rd stack item.
0xa7 ADD Addition operation
0xa8 SWAP2 Swap 1st and 3rd stack items.
0xa9 POP Remove item from stack
0xaa DUP1 Duplicate 1st stack item.
0xab DUP1 Duplicate 1st stack item.
0xac PUSH1 Place 1-byte item on the stack 0x1
0xae ADD Addition operation
0xaf SWAP2 Swap 1st and 3rd stack items.
0xb0 POP Remove item from stack
0xb1 POP Remove item from stack
0xb2 PUSH1 Place 1-byte item on the stack 0x9c
0xb4 JUMP Jump to destination
0xb5 JUMPDES Destination
0xb6 DUP2 Duplicate 2nd stack item.
0xb7 PUSH1 Place 1-byte item on the stack 0x0
0xb9 DUP2 Duplicate 2nd stack item.
0xba SWAP1 Swap 1st and 2nd stack items.
0xbb SSTORE Save word to storage
- 0x91 to 0x9b:
i
is initialized.
- 0x9c: Set an anchor for a jump.
- 0x9d to 0xA2: This is a condition statement, similar to if.
- 0xa2: Address for after the loop
- 0xa4: JUMPI which will check if we need to loop again
- 0xa5 to 0xb1: Body of the loop
- 0xb2: Push the loop condition address to the stack
- 0xb4: Jump to the address located in the first item of the stack
- 0xb5 to 0xbb: After the loop.
Testing in the VM
The only missing part in our Rust VM is the implementation of JUMP. This is even more simple
than JUMPI as we just need to pop the next value of pc from the stack.
Opcode::JUMP(_addr) => {
let jump_location = self.stack.pop().unwrap();
self.pc = jump_location.as_u64() as usize;
},
Creating binary code by hand is beginning to be a bit of work so I’ll just implement some missing opcodes such as DUP, SWAP and ISZERO and I’ll use the code compiled by solc after
cleaning it up a bit (need to replace the JUMP address by the real ones).
5b60008060009150600090505b600a81101560255780820191508080600101915050600c565b81600081905550505000
Interlude
Code begins to be more and more complex so I took some time to create a step by step debugger.
First, add this to the Vm implementation. It can be completed later with more debugging statements.
// see part 2 for print_stack
pub fn print_stack(&self) {
self.stack
.iter()
.enumerate()
.rev()
.for_each(|(i, x)| {
let mut bytes = vec![0;32];
x.to_big_endian(&mut bytes);
println!("|{}:\t{:?}|", i, bytes)
});
}
pub fn print_debug(&self) {
println!("pc:{}\n", self.pc);
println!("Stack:");
self.print_stack();
}
Then, main.rs will be modified to execute this:
use std::io;
fn debug(vm: &mut Vm) {
loop {
if vm.at_end {
break;
}
// Debugger.
// c to continue
// s to print stack
// q to quit
let mut input = String::new();
io::stdin().read_line(&mut input).ok().expect("Couldn't read line");
match &*input {
"c\n" => vm.interpret(),
"s\n" => vm.print_debug(),
"q\n" => break,
_ => println!("Please type either c, s or q"),
}
}
}
Using this function, I can execute the bytecode step by step and take a look at the current VM
state.
Running it!
Mmh, there are a few surprises here (I forgot to implement POP and had a bug in my JUMPI).
When decomposed, a for loop is quite simple to implement using the EVM opcodes.
Pure functions
Just taking a look at for loops makes a light blog post so I’ll also talk about pure functions.
According to solidity documentation,
Functions can be declared pure, in which case they promise not to read from or modify the state.
Looks like a simple first step into understand functions in the compiled code.
pragma solidity ^0.4.0;
contract TestingStuff {
uint public x;
function f(uint a, uint b) public pure returns (uint) {
return a * (b + 42);
}
function add() public {
x = f(1, 4);
}
}
What could go wrong.
A quick peak at the compile code
What I want to find in the compile code is:
- How to store the parameters of a function
- How to call the function
- How to return a value
So let’s try to find that out.
0xef JUMPDES Destination
0xf0 PUSH1 Place 1-byte item on the stack 0x0
0xf2 PUSH1 Place 1-byte item on the stack 0x2a
0xf4 DUP3 Duplicate 3rd stack item.
0xf5 ADD Addition operation
0xf6 DUP4 Duplicate 4th stack item.
0xf7 MUL Multiplication operation
0xf8 SWAP1 Swap 1st and 2nd stack items.
0xf9 POP Remove item from stack
0xfa SWAP3 Swap 1st and 4th stack items.
0xfb SWAP2 Swap 1st and 3rd stack items.
0xfc POP Remove item from stack
0xfd POP Remove item from stack
0xfe JUMP Jump to destination
0xff JUMPDES Destination
0x100 PUSH2 Place 2-bytes item on the stack 0x1 0xb
0x103 PUSH1 Place 1-byte item on the stack 0x1
0x105 PUSH1 Place 1-byte item on the stack 0x4
0x107 PUSH2 Place 2-bytes item on the stack 0x0 0xef
0x10a JUMP Jump to destination
0x10b JUMPDES Destination
0x10c PUSH1 Place 1-byte item on the stack 0x0
0x10e DUP2 Duplicate 2nd stack item.
0x10f SWAP1 Swap 1st and 2nd stack items.
0x110 SSTORE Save word to storage
0x111 POP Remove item from stack
0x112 JUMP Jump to destination
0x113 STOP Halts execution
No biggies here. The function starts at address 0x100. Let’s take a look at the stack during the execution.
[0x10b]
[0x10b, 0x01]
[0x10b, 0x01, 0x04]
[0x10b, 0x01, 0x04, 0xef]
This is the stack before 0x10a (JUMP). The address of where to continue after calling the function is
first put on the stack. Then the two function parameters are added, then the address of the code
for the function is pushed on the stack so that we can call jump.
[0x10b, 0x01, 0x04] // Call jump
At this point, we are at address 0xef
[0x10b, 0x01, 0x04, 0x0]
[0x10b, 0x01, 0x04, 0x0, 0x2a]
[0x10b, 0x01, 0x04, 0x0, 0x2a, 0x4]
[0x10b, 0x01, 0x04, 0x0, 0x2e]
[0x10b, 0x01, 0x04, 0x0, 0x2e, 0x01]
[0x10b, 0x01, 0x04, 0x0, 0x2e]
[0x10b, 0x01, 0x04, 0x2e, 0x0]
[0x10b, 0x01, 0x04, 0x2e]
[0x2e, 0x01, 0x04, 0x10b] // swap 1st and 4th
[0x2e, 0x10b, 0x04, 0x01] // swap 1st and 3rd
[0x2e, 0x10b, 0x04]
[0x2e, 0x10b]
A lot of swap, dup is going on but essentially the calculation is made. Then, by swapping element and popping
the intermediate variables, the stack end up with the return value in second position and the address of where
to return in the first position.
After jumping, it will look like:
[0x2e]
[0x2e, 0x0]
[0x2e, 0x0, 0x2e]
[0x2e, 0x2e, 0x0]
Then, the store instruction is executed as expected.
The funny thing here is that I do not have to add anything to my Rust code in order to make it
work for these kind of functions. A pure function is just a clever combinaison of JUMPs, PUSH and
SWAP:
- Return location is pushed
- Arguments are pushed in order
- Function location is pushed then we jump to it
- Body of the function is executed
- Return location is put at the top of the stack, return value is right next behind
- We jump to the return location. Value at the top of the stack is the function return value.
Easy peasy.
15 Nov 2018
Chip 8 Emulator
Nostalgia is big in the gaming scene so it is not a wonder that emulators
of all sort would see the light. Gameboy, PS1, Nintendo 64… How did they
work exactly was a bit a mystery to me before. When I found this article
http://www.multigesture.net/articles/how-to-write-an-emulator-chip-8-interpreter/
, I couldn’t help but coding my own Chip8 emulator.
I won’t describe the code here as the original article does already a great
job. However, there are a few points that I learnt and will definitively
apply to my next emulator project (Gameboy/NES of course ! I need to play Zelda).
DRY
Emulation involves a lot of repetitive work. Opcodes sometimes are really similar
(add constant to register, add two registers together). Also, the program counter
needs to be increased most of the time after executing an opcode. All the common operations (fetching information from the opcode, increasing the opcode) should be factored in common functions in order to avoid bugs when writing the same thing a lot of times. I know, this is coding 101 but it does not hurt to repeat especially in that case where we can find a lot of code duplication.
Unit test that thing
Not everything can be testing in emulation. For example, graphics and sound might be
tricky. Testing the opcodes individually however is easy and should be done to validate the correct
behaviour.
From my emulator:
TEST(opcode, op_2nnn) {
Chip8FreeAccess chip8;
// 0x2204 - execute subroutine at index 204.
std::vector<uint8_t> source {0x22, 0x04, 0xA2, 0xF0, 0xA2, 0xFF};
chip8.load_from_buffer(source);
chip8.emulateCycle();
// stack should be one. pc would be stored as 0x200. Current pc is 0x204.
ASSERT_EQ(1, chip8.sp());
ASSERT_EQ(0x204, chip8.pc());
ASSERT_EQ(0x200, chip8.stacks()[0]);
// to confirm execute the next cycle
chip8.emulateCycle();
auto i = chip8.I();
ASSERT_EQ(0x2FF, i);
}
This is testing that the subroutine opcode does the correct thing. I have similar tests
for most opcodes (and should have for all of them…).
Testing all combination of opcodes is not possible but fortunately the emulation scene provides
some ROM which are only used for testing emulators.
Using gdb is great to debug programs but I also created some debugging tools to dump
info in the emulator window at the same time as the game is running. If the code is
stuck because of a bug, I can see immediately on my screen.
Also, I have some code to only print the opcodes in a binary file, not execute them. That
way I can take my time to look at the code without running the game.
Especially when reverse-engineering (see Ethereum articles), taking a good look at
the binary format saves a LOT of time so implementing tooling will be something I’ll
do again next time.
I put my emulator on Github. It should work with most of the games in the repo although I didn’t test it extensively.
My next project is a NES emulator, which is a tad more complication but I cannot wait to start. Cheers!
14 Nov 2018
Memory model of Ethereum - The stack
Welcome to the second post of my Ethereum VM series. Part 1 was dedicated to the instruction set
of the Ethereum VM and showed a simple way of printing instructions from a smart contract
binary file. Now, I’ll talked more about the memory model of the VM and how to understand the
code compiled by solc.
I’ll try to answer those questions:
- How are additions, multiplications and subtractions implemented?
- How to find the compiled instructions for a block of solidity code?
- How are IF condition implemented?
- Where is the smart contract data stored?
There is a lot to cover here. In this post, I’ll show what kind of operations are possible using the stack. There is a lot to cover so maybe this post will be split in two…
VM specifications
The VM is described in the yellow paper, part 9: Yellow paper
The word size of the machine (and thus size of stack items0 is 256-bits.
32 bytes word size is quite large. Yellow paper mentioned it is to facilitate the Keccak-256 hash scheme and elliptic-curve computations. Hum. Let’s see later if we can find out what they mean.
The EVM is a simple stack-based architecture.
Stack-based architectures carry out their operation by pushing and popping items on a stack. A
stack is just a LIFO (last in, first out) container. For example, a + b can be done by:
- pushing a on the stack
- pushing b on the stack
- pop a and pop b
- push a+b on the stack
The memory model is a simple word-addressed byte array.
Memory is a random-access byte array. We read a word at a time (32 bytes).
Unlike memory which is volatile, storage is non volatile and is maintained as part of the system state.
This is where the fun happen. The data is persisted between transactions in this storage. It
is stored under the contract’s account.
Fun with the stack
According to Rust official documentation, the best way to model a stack is just to use a vector.
Vec has push
and pop
operations so it fits well with our use case. Unfortunately, the size
of the stack items is 256 bits, which is not standard in Rust so I needed to use an external
package (I don’t really want to implement 256-bits arithmetic myself…).
The bigint package looks pretty complete: BigInt documentation
We’ll have to be careful with endianess here… (Ah, can’t avoid this problem forever)
Starting from part 1 implementation, our VM will become:
struct Vm {
code: Vec<u8>, // This is the smart contract code.
pc: usize,
stack: Vec<U256>,
}
And the factory function:
fn new_from_file(filename: &str) -> Result<Vm, Box<Error>> {
let mut f = File::open(filename)?;
let mut buffer = String::new();
f.read_to_string(&mut buffer)?;
let code = decode(&buffer)?;
Ok(Vm { code: code, pc: 0, stack: Vec::new()})
}
Additions
As mentioned above, an addition implemented with a stack can be with the following operations:
- PUSH left operand
- PUSH right operand
- POP 2 items from stack and add them
- PUSH result
It can be easily translated to Ethereum instructions.
- PUSH1 left
- PUSH1 right
- ADD
For example, if we want to add 2 and 5, the binary code would be:
0x60 0x02 0x60 0x05 0x01
PUSH 2 PUSH 5 ADD
Let’s save the string “6002600501” to a file and use it for our testing.
I’ll create an interpret
function that will read the next instruction and apply it to
the VM.
fn interpret(&mut self) {
let maybe_op = self.next();
// just for debugging
match &maybe_op {
Some(x) => x.describe(),
None => {}
}
// The real execution
match &maybe_op {
Some(x) => {
match x {
Opcode::PUSH1(addr, value) => {
// Value is u8, we need to push a u256 on the stack...
self.stack.push(U256::from(*value));
},
Opcode::ADD(addr) => {
// How to recover nicely? There is no meaning in recovering nicely here.
// Just burn and crash if cannot pop.
let v1 = self.stack.pop().unwrap();
let v2 = self.stack.pop().unwrap();
self.stack.push(v1 + v2);
},
_ => {
// not implemented, just chill
}
}
},
None => {}
}
}
For pushing a u8 integer on the stack, we can just convert it to U256 and use the built-in
push method of vector. Popping is also built-in, but it returns an Option. If the option is None at this point, there is NO way to recover gracefully as something must have gone wrong
with the compilation. Unwrap looks like a correct choice in that case.
By the way, I am printing the instruction description before execution. It’s always a good idea
to add this kind of debug print during development. It is easier to find out when something goes
wrong. In the same spirit, I’ll add a function to print the stack.
fn print_stack(&self) {
self.stack
.iter()
.enumerate()
.rev()
.for_each(|(i, x)| {
let mut bytes = vec![0;32];
x.to_big_endian(&mut bytes);
println!("|{}:\t{:?}|", i, bytes)
});
}
Here again, chaining iterator methods really makes my day :).
My main.rs is growing quite a lot so I’ll separate the code in three files:
- main.rs: Entry point. Can either print the instructions or execute the code
- evm/vm.rs: VM code and execution
- evm/opcode.rs: Opcode enumeration and debug print function
Main file will look like that:
mod evm;
use evm::vm::Vm;
use evm::opcode::Opcode;
use std::error::Error;
use std::env;
fn debug(vm: &mut Vm) {
loop {
match vm.next() {
Opcode::EOF => break,
x => x.describe(),
}
}
}
fn interpret(vm: &mut Vm) {
while !vm.at_end {
vm.interpret();
}
vm.print_stack();
}
fn run() -> Result<(), Box<Error>> {
let args: Vec<String> = env::args().collect();
let function = args[1].clone();
let filename = args[2].clone();
println!("In file {}", filename);
let mut vm = Vm::new_from_file(&filename)?;
println!("Correctly loaded VM");
match &*function {
"debug" => debug(&mut vm),
"run" => interpret(&mut vm),
_ => panic!("Expect either 'debug' or 'run' for first parameter")
}
Ok(())
}
fn main() {
run().unwrap();
}
Also, I did a bit of refactoring for printing the Opcode. Instead of returning Option<Opcode>
from the next
method, I’ll return Opcode
directly. I also add a new Opcode for debugging
called UNKNOWN which contains the opcode that is not implemented yet.
A word on unit testing
Previously I was just struggling to get my program working. Now that I am a bit more confident
in Rust (or is it an illusion), it is time to add some unit testing.
The convention in Rust is to put unit tests for a module into a submodule. Basically, it comes to adding the following at the end of each file.
#[cfg(test)]
mod test {
// unit tests here
}
For my vm code, I expect the code to grow quite a lot considering the number of opcode to implement. I’ll move the tests to another file. That was actually more annoying to do than I initially though. If you have a better way to do this, please add a comment.
The Vm fields need to be public. The test file is in evm/vm_test.rs and the module is added
in evm/mod.rs.
// evm/mod.rs
#[cfg(test)]
mod vm_test;
// evm/vm_test.rs
use super::vm::Vm;
#[test]
fn sometest() {
}
To test the addition, we can create a fake binary array (Vec), create a new VM and execute
the code. After execution, we can access the state of the VM and verify that the number on the
stack corresponds to our expectations.
use super::vm::Vm;
fn create_vm(binary: Vec<u8>) -> Vm {
Vm { code: binary, pc: 0, stack: Vec::new(), at_end: false}
}
#[test]
fn addition() {
// 1 + 5
let binary = vec![0x60, 0x0f, 0x60, 0x01, 0x01, 0x00];
let mut vm = create_vm(binary); //moved
// execute three instructions.
// push 0x0f
vm.interpret();
// push 0x01
vm.interpret();
// add
vm.interpret();
// halt
vm.interpret();
// Now make sure the stack size is 1 and contains 16.
assert_eq!(1, vm.stack.len());
assert_eq!(16, vm.stack[0].as_u32()); // this is panicking in case of overflow.
}
Find the addition in our compile Ethereum contract
Let’s compile this code:
pragma solidity ^0.4.0;
contract Addition{
int public x;
function add() public {
int a = 5;
int b = 6;
x = a + b;
}
}
We should see these instructions:
- Initialize a and b
- Add a and b in x
- store x
Looking directly at the compiled output is hard. Solc can generate intermediate code by passing
the –asm flag. So let’s run solc --bin-runtime --overwrite --asm --optimize -o . contract.sol
and examine the .evm file that was created.
Solc is nice and annotate the .evm file. I can find these lines:
tag_13:
/* "contract.sol":170:175 a + b */
0xb
/* "contract.sol":119:124 int a */
0x0
/* "contract.sol":166:175 x = a + b */
sstore
/* "contract.sol":87:182 function add() public {... */
jump // out
What happens here?
- 0xb is a shorthand for “PUSH 0xb” on the stack
- 0x0 is a shorthand for “PUSH 0x0” on the stack
- sstore will pop two values from the stack. The first one will be the key, the second one will be the value. It will then store Key=>Value in persistent storage.
This code is just storing the amount 11 at address 0 of the persistent storage. This is optimized
code so it detected at compile time that a + b
will always be 11. If you remove the unoptimized
code, you’ll notice that the output is less pretty:
tag_13:
/* "contract.sol":119:124 int a */
0x0
/* "contract.sol":138:143 int b */
dup1
/* "contract.sol":127:128 5 */
0x5
/* "contract.sol":119:128 int a = 5 */
swap2
pop
/* "contract.sol":146:147 6 */
0x6
/* "contract.sol":138:147 int b = 6 */
swap1
pop
/* "contract.sol":174:175 b */
dup1
/* "contract.sol":170:171 a */
dup3
/* "contract.sol":170:175 a + b */
add
/* "contract.sol":166:167 x */
0x0
/* "contract.sol":166:175 x = a + b */
dup2
swap1
sstore
pop
/* "contract.sol":87:182 function add() public {... */
pop
pop
jump // out
There are a lot of superfluous instructions here. Swap, pop and dup are just stack operations.
If we follow the instructions and print the stack, it would look like the following:
[0] // int a
[0, 0] // int b
[0, 0, 5] // 5
[5, 0, 0] // a = 5
[5, 0] // pop
[5, 0, 6] // 6
[5, 6, 0] // b = 6
[5, 6] // pop
[5, 6, 6] // b - dup1
[5, 6, 6, 5] // a = dup3
[5, 6, 11] // a + b
[5, 6, 11, 0]
[5, 6, 11, 0, 11]
[5, 6, 11, 11, 0]
[5, 6, 11] // store. here it will store 11 at address 0
[5, 6] // pop
[5] // pop
[] // pop
We can see a bit the logic behind the compiler. For example, to initialize an integer, it will
first push 0 on the stack. When assigning the value, it will push the value then swap with the
0 that corresponds to this integer. It will then pop the extra 0.
Let’s find the corresponding opcodes for this block of intermediate code. I’ll first add the POP, DUP, SWAP and SSTORE opcodes to my Rust code. Then after compiling and running the debug
program, I can find the following:
0xd9 JUMPDES Destination
0xda PUSH1 Place 1-byte item on the stack 0x0
0xdc DUP1 Duplicate 1st stack item.
0xdd PUSH1 Place 1-byte item on the stack 0x5
0xdf SWAP2 Swap 1st and 3rd stack items.
0xe0 POP Remove item from stack
0xe1 PUSH1 Place 1-byte item on the stack 0x6
0xe3 SWAP1 Swap 1st and 2nd stack items.
0xe4 POP Remove item from stack
0xe5 DUP1 Duplicate 1st stack item.
0xe6 DUP3 Duplicate 3rd stack item.
0xe7 ADD Addition operation
0xe8 PUSH1 Place 1-byte item on the stack 0x0
0xea DUP2 Duplicate 2nd stack item.
0xeb SWAP1 Swap 1st and 2nd stack items.
0xec SSTORE Save word to storage
0xed POP Remove item from stack
0xee POP Remove item from stack
0xef POP Remove item from stack
0xf0 JUMP Jump to destination
0xf1 STOP Halts execution
This map quite well to the .ecm file ;)
Flow control - If and Loops
At this point, I am not so sure about what happen when compiling an if statement. I assume the
opcodes for this feature would be a combinaison of one of the logic operation (LT, GT, …) and
JUMP opcode to go to change pc to the correct block of instructions. To be certain, I’ll change
my previous solidity program to include a simple if-else statement and I’ll take a look at the
diff between the two compiled programs.
pragma solidity ^0.4.0;
contract Addition{
int public x;
bool public large;
function add() public {
int a = 5;
int b = 6;
x = a + b;
if (x < 5) {
x = 5;
} else {
x = 7;
}
}
}
Again, this smart contract is completely useless. Anyway, the output of solc gives us this:
0x0
/* "contract.sol":154:163 x = a + b */
dup2
swap1
sstore
pop
/* "contract.sol":182:183 5 */
0x5
/* "contract.sol":178:179 x */
sload(0x0)
/* "contract.sol":178:183 x < 5 */
slt
/* "contract.sol":174:251 if (x < 5) {... */
iszero
tag_15
jumpi
/* "contract.sol":203:204 5 */
0x5
/* "contract.sol":199:200 x */
0x0
/* "contract.sol":199:204 x = 5 */
dup2
swap1
sstore
pop
/* "contract.sol":174:251 if (x < 5) {... */
jump(tag_16)
tag_15:
/* "contract.sol":239:240 7 */
0x7
/* "contract.sol":235:236 x */
0x0
/* "contract.sol":235:240 x = 7 */
dup2
swap1
sstore
pop
/* "contract.sol":174:251 if (x < 5) {... */
tag_16:
/* "contract.sol":87:257 function add() public {... */
pop
pop
jump // out
The opcode SLOAD, SLT, ISZERO and JUMPI are used for this IF statement.
- sload(0x00) will just get the value stored at address 0x00 (it is x!)
- slt will add 1 to the stack if first value of stack is strictly less than the second value of the stack
- iszero pop a value and push 1 if the value is zero, 0 otherwise (simple not operator)
- jumpi will jump to the address stored at second place in the stack in the first value of the stack is not 0.
In our case, the stack will look like (after SLT)
[1] // SLT
[0] // ISZERO
[0, Address of tag 15] // tag_15
[] // just increment pc as usual. No jump here.
Then assign 5 to x. If SLT was false, then JUMPI would have set pc to tag_15 and we would have
assigned 7 to x instead.
In opcode, this translates to:
0xfb PUSH1 Place 1-byte item on the stack 0x5
0xfd PUSH1 Place 1-byte item on the stack 0x0
0xff SLOAD Load word from storage
0x100 SLT Signed less-than comparison
0x101 ISZERO Simple not operator.
0x102 PUSH2 Place 2-bytes item on the stack 0x1 0x12
0x105 JUMPI Conditional jump to destination
0x106 PUSH1 Place 1-byte item on the stack 0x5
0x108 PUSH1 Place 1-byte item on the stack 0x0
0x10a DUP2 Duplicate 2nd stack item.
0x10b SWAP1 Swap 1st and 2nd stack items.
0x10c SSTORE Save word to storage
0x10d POP Remove item from stack
0x10e PUSH2 Place 2-bytes item on the stack 0x1 0x1b
0x111 JUMP Jump to destination
0x112 JUMPDES Destination
0x113 PUSH1 Place 1-byte item on the stack 0x7
0x115 PUSH1 Place 1-byte item on the stack 0x0
0x117 DUP2 Duplicate 2nd stack item.
0x118 SWAP1 Swap 1st and 2nd stack items.
0x119 SSTORE Save word to storage
0x11a POP Remove item from stack
0x11b JUMPDES Destination
0x11c POP Remove item from stack
0x11d POP Remove item from stack
0x11e JUMP Jump to destination
0x11f STOP Halts execution
Note that the “tag_15” was actually replaced by pushing 2 bytes to the stack (0x11b). When we
take a look at line 0x11b, it actually corresponds to a JUMPDEST.
Great! We made a great progression in our understanding of the IF statement. First, we need to
set 0 or 1 on the stack depending on what condition we choose. Then, we need to push to address of the “then” block and use JUMPI. JUMPI will set the program counter to the address of the “then” block if the condition is 1. Otherwise, the program execution will continue as normal.
It should be noted that in the compile code, the “then” block in the binary file is located directly after the JUMPI instruction. In order to do so, an instruction ISZERO is added directly
after the comparison. Why do it like that? One of the possible reason is code locality. The
compiler assumes that the “then” block is executed more often than the “else” block so it makes
sense to keep it closer in memory.
Intepreting a IF statement
Now that I have a clear image of how the If-else statement is implemented in the EVM, I’ll continue the Rust interpreter and implement the missing opcodes. The program to interpret will
be:
0x60
0x07 // push 7
0x60
0x05 // push 5
0x12 // execute SLE (5 < 7)
0x60
0x0C // push 0x0C
0x57 // JumpI
0x60
0x01 // push 0x01
0xbb // My special instruction
0x00 // Halt.
0x5b // Jumpdest
0x60
0x02 // push 0x02
0xbb
0xbb is Benoit instruction which is my special instruction. It will pop and print an item from the stack. This code should print “1” as 5 < 7.
The opcodes’ implementation is the following:
// in vm.rs:interpret
Opcode::SLT(_addr) => {
let lhs = self.stack.pop().unwrap();
let rhs = self.stack.pop().unwrap();
if lhs < rhs {
self.stack.push(U256::from(0x01));
} else {
self.stack.push(U256::from(0x00));
}
},
Opcode::JUMPI(_addr) => {
let then_addr = self.stack.pop().unwrap();
let cond = self.stack.pop().unwrap();
if !cond.is_zero() {
self.pc = then_addr.as_u64() as usize;
} // else continue to next :)
}
Opcode::PRINT(_addr) => {
// TODO this should be removed in release mode..
let v = self.stack.pop().unwrap();
let mut bytes = vec![0;32];
v.to_big_endian(&mut bytes);
println!("PRINT\t{:?}|", bytes)
},
If you run the program, it should display “PRINT” followed by 1. You can also change the operand
to see that the else block can be reached.
At this point, it would be worth it to implement some tooling. For example, a step by step debugger would be greatly useful.
Wrapping up
That’s a bunch of words already. We saw how the stack of the EVM was used to perform basic arithmetic and flow operations. We saw how to compile the code and map the solidity code to the
evm code. We saw how to add unit tests to validate the behaviour of the VM. Phew!
There are still some missing flow operations that I’d like to explore. In particular, I want to
review the loop and the functions. This is going to be the topic of the next part of this series.
09 Nov 2018
Exploring Ethereum Virtual Machine
Welcome to the great series - if everything goes well - about the Ethereum Virtual machine.
Lately I have been growing fond of emulation and cryptocurrency so I decided to take a look
at the inside of Ethereum: the virtual machine that is executing smart contracts.
And while implementing a fake EVM (Ethereum Virtual Machine) can be a challenge, why not do
it in a language I have absolutely no experience in. After these few lines, I realize the
probabilities I finish this serie of blog post is meager at best.
Anyway let’s get started. In this first part I’ll talk about compiling a solidity contract
to binary file, and creating a small program to read each instruction.
Our Amazing smart contract
Take a look at this beauty.
pragma solidity ^0.4.0;
contract Addition{
int public x;
function add(int a, int b) public {
x = a + b;
}
}
It’s not doing much. Anybody can call add
which will just store the addition of its
argument in the ledger. Then anybody can read the value.
Then, run solc --bin-runtime --optimize -o . contract.sol
to output the compiled contract binary code to the current directory.
On my computer, I get:
608060405260043610603e5763ffffffff7c0100000000000000000000000000000000000000000000000000000000600035041663a5f3c23b81146043575b600080fd5b348015604e57600080fd5b50605b600435602435605d565b005b016000555600a165627a7a723058204ff1427599e28990ab2413948c03501a48ab89d18888ac7d0205c12f443424070029
By the way, this is an hexadecimal string. This is going to be important when we read it from Rust (https://solidity.readthedocs.io/en/v0.4.21/using-the-compiler.html)
Reading the binary file
Rust makes it quite easy to read data from a file.
use std::fs::File;
use std::io::prelude::*;
use std::error::Error;
fn run() -> Result<(), Box<Error>> {
let filename = "myfilename";
let mut f = File::open(filename)?;
let mut buffer = String::new();
f.read_to_string(&mut buffer)?;
println!("{}", buffer);
}
fn main() {
run().unwrap();
}
This is going to print the content of our file. Right now, the whole code is loaded in the
string. There might be better way to do it but ultimately the code is always loaded in the RAM
of the EVM. Also, the smart contracts tend to be small so it should be no problem to do it like
that.
Opcode and Virtual machine execution
A virtual machine, such as the Java Virtual Machine, will take a set of instructions and execute
them. in Java the set of instruction is in the Java bytecode (https://en.wikipedia.org/wiki/Java_bytecode). In Ethereum, the set of instruction is the content of the binary file we just read. The instruction set is described in the Ethereum Yellow paper, in the appendix. https://ethereum.github.io/yellowpaper/paper.pdf
For example, the value 0x30 corresponds to the ADDRESS instruction (or opcode) and tells the
ethereum VM to get the address of currently executed account. Let’s not worry about all the technical terms for now. Just keep in mind that our binary file is a set of instructions, and
each instructions are 8 bits, or 1 byte, long.
Our file content is an hexadecimal string, so we can represent the first bytes as:
0x60 0x80 0x60 0x40 0x52
If you look at the specifications, you can convert to a more readable format:
- PUSH one byte to stack: 0x80
- PUSH one byte to stack: 0x40
- STORE one word to memory
Next step in our program is to convert our string to a list of bytes (u8
in Rust).
use std::num::ParseIntError;
fn decode(s: &str) -> Result<Vec<u8>, ParseIntError> {
(0..(s.len()-1))
.step_by(2)
.map(|i| u8::from_str_radix(&s[i..i+2], 16))
.collect()
}
It will construct a vector of bytes. Function composition here makes the code very expressive:
- We get an iterator from 0 to the length of our string (excluded)
- We get another iterator that will yield the first iterator elements with a step of 2. (Yield 0, 2, 4, 6… and so on)
- Then, for each index, we apply
u8::from_str_radix
to the slice [i..i+2] of our input string. This is going to convert the string to a u8
integer, base 16. This can panic in case the string does not represent a valid integer base 16 (‘P0’ would panic)
- We consume the iterators with collect.
Iterators are lazy in Rust, so we need to call collect at the end to consume them. See here for more details: https://doc.rust-lang.org/book/second-edition/ch13-02-iterators.html
The Result trait implements FromIter, so instead of returning Vec<Result<u8, ParseIntError>>
, we can return Result<Vec<u8>, ParseIntError>
. See also here https://doc.rust-lang.org/std/result/enum.Result.html#method.from_iter
Now we can print our bytes from the files.
fn run() -> Result<(), Box<Error>> {
let filename = "myfilename";
let mut f = File::open(filename)?;
let mut buffer = String::new();
f.read_to_string(&mut buffer)?;
let bytes = decode(&buffer)?;
for b in &bytes {
println!("0x{:x}", b)
}
println!("{}", buffer);
}
Our simple EVM
Now that we can read the bytes of our compiled smart contract, let’s start the implementation
of the VM. I am just going to show how to print debug information about the instructions in this
part. We are going to iterate over the list of instructions and print what they mean. No stack
or memory involved here :)
The most basic VM will hold the code in memory, and will iterate through it. We could use range-based loop to do the iteration, but you will see later that it won’t work nicely with our use case.
For example, not all instructions are only one byte long. Some, such as PUSH3, are 4 bytes long (one byte for the instruction value, and 3 bytes for the value to push to the stack).
For that reason, I am going to keep an index of the current instruction in the code vector. This index is often called pc, for program counter.
struct Vm {
code: Vec<u8>, // smart contract code
pc: usize, // current instruction
}
impl Vm {
fn new_from_file(filename: &str) -> Result<Vm, Box<Error>> {
let mut f = File::open(filename)?;
let mut buffer = String::new();
f.read_to_string(&mut buffer)?;
let code = decode(&buffer)?;
Ok(Vm { code: code, pc: 0})
}
}
new_from_file
will initialize us a new VM. Then, we need a way to model the instruction. I found enumeration in Rust well suited for this as they are very flexible. I also really like
the pattern matching with enumerations.
Let’s get started. There are more than an hundred instructions in the EVM instruction set so
I’ll just show a few of them.
enum Opcode {
STOP, // 0x00
ADD, // 0x01
MUL, // 0x02
PUSH1(u8), // 0x60
PUSH2(u8, u8), // 0x61
PUSH32(u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8), // 0x7f
}
PUSH1(u8)
means that the instruction is made of 2 bytes. First byte is the instruction value,
second byte is the value to push to the stack. We are going to store this value in the enumeration.
Now that we have our Opcode enumeration, we need to yield opcodes from the list of bytes. This is called decoding. The next
function will return the Opcode at the current pc, and then advance pc to the next instruction.
impl Vm {
fn next(&mut self) -> Option<Opcode> {
match self.code[self.pc] {
0x00 => {
self.pc += 1;
Some(Opcode::STOP)
},
0x01 => {
self.pc += 1;
Some(Opcode::ADD)
},
0x02 => {
self.pc += 1;
Some(Opcode::MUL)
},
0x60 => {
let value = self.code[self.pc+1];
self.pc += 2;
Some(Opcode::PUSH1(value))
},
Ox61 => {
let value0 = self.code[self.pc+1];
let value1 = self.code[self.pc+2];
self.pc += 3;
Some(Opcode::PUSH2(value0, value1))
}
_ => { self.pc += 1; None}
}
}
}
Right now, I ignore the error cases (for example, buffer overflow). Rust will panic if I try to
access values with indexes that are larger that the size of the vector. Also, the code is a bit
repetitive so there might be a better way to do it.
What next
is doing is basic, but at the heart of the instruction decoding:
- Get the current byte
- Match it with an opcode
- If the opcode needs additional data from the instruction vector, extract them
- Move pc to the next instruction
- Return Opcode if found
For the PUSH
instructions, pc is incremented multiple times.
Peeking inside the code
next
will give us the next instruction to execute. For now, I am just going to print it to
the console without extra logic. This will be useful in the future to debug our code.
The easiest way is to make our enum derive from Debug
. Then we’ll be able to print the enum
name with println!
.
#[derive(Debug)]
enum Opcode {
...
}
fn run() -> {
...
loop {
match vm.next() {
Some(x) => println!("{:?}", x),
None => {}
}
}
}
Just by doing so, I have the following output.
ADD
PUSH1(0)
STOP
MUL
STOP
thread ‘main’ panicked at ‘index out of bounds: the len is 143 but the index is 143’, libcore/slice/mod.rs:2046:10
This is good, but we can do much better. First, just display the enumeration name and content does not give enough information. I’d like a description of the opcode and the address of the opcode in the binary. Second, this panic should not be there, as our program executed as expected.
To avoid the panic, I’ll add a secret Opcode (sshh) that signifies our program ends. We could break on None in the loop but I still want to intepret the whole code even if there are codes I
haven’t implemented yet. So let’s add EOF
to our Opcode enumeration. In reality, 0x00 will
take care of that.
Before the match in next
:
if self.pc >= self.code.len() {
return Some(Opcode::EOF);
}
And in the match of the main loop:
loop {
match vm.next() {
Some(Opcode::EOF) => break,
Some(x) => println!("{:?}", x),
None => {}
}
}
Happy days!
At last, I want to print more information. so let’s add the instruction number and the description for each enum.
#[derive(Debug)]
enum Opcode {
STOP(usize), // 0x00
ADD(usize), // 0x01
MUL(usize), // 0x02
PUSH1(usize, u8), // 0x60
PUSH2(usize, u8, u8), // 0x61
PUSH32(usize, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8), // 0x7f
EOF,
}
And next will become:
fn next(&mut self) -> Option<Opcode> {
if self.pc >= self.code.len() {
return Some(Opcode::EOF);
}
let addr = self.pc;
match self.code[addr] {
0x00 => {
self.pc += 1;
Some(Opcode::STOP(addr))
},
0x01 => {
self.pc += 1;
Some(Opcode::ADD(addr))
},
0x02 => {
self.pc += 1;
Some(Opcode::MUL(addr))
},
0x60 => {
let value = self.code[self.pc+1];
self.pc += 2;
Some(Opcode::PUSH1(addr, value))
},
0x61 => {
let value0 = self.code[self.pc+1];
let value1 = self.code[self.pc+2];
self.pc += 3;
Some(Opcode::PUSH2(addr, value0, value1))
},
_ => { self.pc += 1; None}
}
}
Now we can create a function that will describe the opcode.
impl Opcode {
fn describe(&self) {
match self {
Opcode::STOP(line) => println!("0x{:x}\tSTOP\tHalts execution", line),
Opcode::ADD(line) => println!("0x{:x}\tADD\tAddition operation", line),
Opcode::MUL(line) => println!("0x{:x}\tMUL\tMultiplication operation", line),
Opcode::PUSH1(line, x) => println!("0x{:x}\tPUSH1\tPlace 1-byte item on the stack 0x{:x}", line, x),
Opcode::PUSH2(line, x0, x1) => println!("0x{:x}\tPUSH2\tPlace 2-bytes item on the stack 0x{:x} 0x{:x}", line, x0, x1),
_ => println!("Unknown opcode")
}
}
}
// update run function
fn run() -> Result<(), Box<Error>> {
let filename = "Addition.bin-runtime";
println!("In file {}", filename);
let mut vm = Vm::new_from_file(&filename)?;
loop {
match vm.next() {
Some(Opcode::EOF) => break,
Some(x) => x.describe(),
None => {}
}
}
Ok(())
}
A final word
I showed in this article how to create a tool to display the instructions from an ethereum smart contract binary. It is not complete: the describe
and next
method have to be populated with all the opcodes in order to be complete.
In the next article I will introduce the memory layout of the EVM. We’ll talk about concepts such as stack, memory and persistent storage.