This is the fourth post in a series on implementing the Ethereum Virtual Machine in Clojure. Relevant portions of the code are reproduced inline, with the full source available at nervous-systems/sputter on Github.

- Part I: Executing EVM Bytecode.
- Part II: Words, Memory, Gas.
- Ethereum’s Compact Merkle Trie (Part I: Compaction).
- Ethereum’s Compact Merkle Trie (Part II: Merkling).

# Introduction

By the end of the previous post,
we’d implemented a referentially transparent in-memory trie which correctly compacted
nodes. The fundamental abstractions were the `Trie`

and `Node`

protocols.

Our implementation requires a couple of additional features, if we’re to call it a Merkle trie:

- A means of reconstructing a tree with reference to the cryptographic hash of some
root node (i.e. from a
*root hash*). - Generation of Merkle proofs.

It’d also be a good time to add support for persisting tries in a key-value store. While neither of the above requirements specifically entails disk persistence, it may help us reason about how the structure is used in practice.

# What’s a Merkle Tree?

A *merkle tree* is any tree in which nodes are addressed by cryptographic hash,
with each node’s hash computed across the hashes of its descendents.

(The coloured labels are for visual clarity. The hash expressions beneath them —
in terms of some hash function `H`

— are better thought of as the node labels).

Imagine our access to the binary merkle tree above is being intermediated by some
untrusted agent, of whom we request the value corresponding to path `b.f`

(“val4”).

Assuming we know nothing trustworthy about the contents of the tree — only its
root hash and the key or path we requested — the agent can convince us of the
provenance of the value with a *Merkle proof*: a representation of the nodes on
the path to the returned value. In the context of `b.f`

from the above binary tree,
that might look like:

`[[Ha Hb] [He Hf] "val4"]`

Where each inner vector is a node represented as `[left-child-hash, right-child-hash]`

.
To verify, we check that:

`H(Ha + Hb)`

matches our trusted root hash`H(He + Hf)`

matches the hash at index`1`

in the previous node serialization^{1}(`Hb`

)`H("val4")`

matches`Hf`

.

We’re now confident we either chose a terrible hash function, or that the value exists the claimed position in the tree.

^{1}Assume that the client understands how to determine the branch selected at each level, for a given key.

## Uses

An intuitive use for Merkle trees can be found in file distribution
networks. In the peer-to-peer case, a *tracker* — a trusted central authority
— stores the root hashes of Merkle trees constructed from each file’s contents.
Supplied with the trusted hash for some file, a client solicits blocks of content
from untrusted peers, each accompanied by a Merkle proof of their position beneath
the root^{1}.

Ethereum relies on Merkle trees to ensure data consistency: block headers incorporate
the root hashes of various tries used to store bookkeeping information. Further,
full nodes can Merkle-prove any values retrieved at the behest of light clients, who
possess only partial knowledge of the state. A client may request, e.g. the balance of
account `0x180b92ec...`

and verify the accompanying proof against the relevant root hash from the block
headers.

^{1}I believe Gnutella does something like this. While very different projects, IPFS & Docker also distribute files with Merkle trees.

# Ethereum’s Merkle Trie

In order to hash nodes (per the Merkle tree explanation), we require a canonical
node representation — Ethereum’s is vectors of byte strings, encoded in RLP
(*Recursive Length Prefix*) — a simple type-length-value format. When encoding
nodes containing children, the child’s identity is:

`len(rlp(child)) < 32 ? rlp(child) : Keccak256(rlp(child))`

— we use whichever is more compact, the child’s RLP or its hash.

For the purposes of this article, we don’t really care about the details of RLP, just that we’ve a reversible encoding:

```
(-> (->Leaf (->nibbles "k") "v")
trie.rlp/->rlp ;; => <-60, -126, 32, 107, 118>
util.rlp/decode ;; => [<-126, 32, 107> <118>]
trie.rlp/vector->node) ;; => #Leaf{:key ...}
```

Our strategy is similar to several other implementations: we’ll persist nodes
in a key value store (pluggable, with support for LevelDB), using node hashes as keys,
and RLP-encoded nodes as values^{1, 2}. We may store several
independent tries within a single key store — a trie is reconstituted with
reference to both the disk location of the store and a particular root hash within it.

^{1}I think this is an awkward scheme. The persisted values are opaque to the K/V store, which — aside from ensuring inefficient/cumbersome descent — means it’s impossible to safely delete anything without implementing persistent reference counts on top of the K/V store. I think Ethereum’s use of inter-trie references (e.g. some hash stored as value in trie A is the root of trie B) further complicates things.

^{2}Why store the

*nodes*as RLP? No particular reason — given we’re using an unstructured key store, we need

*some*serialization for them. We need to generate RLP in order to calculate hashes, so we may as well use it as the disk format.

# Implementation

How do we determine which nodes are to be retained in memory
(i.e. as realized Node implementations, hanging off the trie’s `root`

), and when
do we write changes to disk?

One straightforward approach’d be to only retain dirty nodes (in addition to the root),
and offer an explicit `Trie/commit`

operation responsible for atomically flushing to
disk any retained nodes — a downside being that we’ll hit the store whenever a
`Trie/-search`

involves clean nodes^{1}. On the upside, we can implement
it with very few changes to our in-memory trie.

Before writing any code, let’s explore what the tree’ll to look like at different stages, with occasional reference to this storage protocol:

```
(ns sputter.state.kv)
(defprotocol KeyValueStore
(retrieve [store k]
"Return the value associated with the given key.")
(insert [store k v]
"Return a new store, with an additional mapping.")
(insert-batch [store k->v]
"Return a new store, with additional mappings."))
```

^{1}We could easily slightly alter this in the future for better read performance — by using a separate read cache, or an explicity

`:dirty?`

flag
for in-memory nodes.
## Example

Keys are represented as sequences of hex nibbles. Branch nodes hold at most one child per nibble (0-15), with an optional value. Leaf and extension nodes are compactions of redundant layers of branch nodes. Leaves point to values, extensions point to child nodes.

Starting from a trie with a `nil`

root, we insert two mappings, *hello* → *hello*
and *help* → *help*. The entirety of the tree now resides in memory, with a
root node like so:

```
#Extension{:key [6 8 6 5 6 12] ;; "hel"
:child #Branch{6 #Leaf{:key [12 6 15]
:value "hello"}
7 #Leaf{:key [0]
:value "help"}}}
;; (->nibbles "hello") = [6 8 6 5 6 12 6 12 6 15]
;; (->nibbles "help") = [6 8 6 5 6 12 7 0]
```

On calling `Trie/commit`

, we expect all of the in-memory nodes to be persisted
through `KeyValueStore`

, resulting in a shallow, memory-efficient representation of the
tree that we can expand via the store as needed. The root node after the commit:

```
#Extension{:key [6 8 6 5 6 12]
:child <bytes -96, 13, 114...>}
```

Where `<bytes -96, 13, 114...>`

is an RLP representation of the branch node from the
previous snippet. In this case, the branch exceeds our threshold for inline representation —
decoding the bytes yields the string `d726c196...`

: a hash we can
retrieve from the store.

```
(-> (kv/retrieve store "d726c196...")
rlp/decode
trie.rlp/vector->node)
;; =>
#Branch{:branches {6 <bytes -55, -126, 60...>
7 <bytes -58, 48, -124, 104, 101, 108, 112>}}
```

The branch’s children are similarly encoded. Decoding the child at position 7 — an inline node, rather than a hash — yields a vector, from which we can easily construct a node:

```
(-> (rlp/decode <bytes -58, 48, -124, 104, 101, 108, 112>)
trie.rlp/vector->node)
;; =>
#Leaf{:key [0] :value "help"}
```

## Trie/commit

At a high level, `Trie/commit`

wants to:

- Visit all realized Nodes depth-first, flattening them into RLP
- Insert into the key store a
`hash`

→`rlp`

mapping for each visited node. - Return a “shallow” representation of the root node in which all children have been flattened.

We’re going to need a means of visiting all of a node’s children, without
breaking the `Node`

abstraction. In the previous instalment, we implemented the
utility function `walk-path`

in terms of `Node/descend`

& `Node/reattach`

— let’s go with a
generic `Node/walk`

function:

```
(ns sputter.state.trie.node ...)
(defprotocol Node
...
(walk [node outer inner]
"Apply `inner` to all children & `outer` to this.
Return the resulting Node."))
(extend-protocol Node
Branch
(walk [this outer inner]
(outer (update this :branches (partial util/map-vals inner))))
Leaf
(walk [this outer _]
(outer this))
Extension
(walk [this outer inner]
(outer (update this :child inner)))
nil
(walk [_ _ _] nil))
```

Using the above to recursively transform nodes beneath `root`

(via
some function `f`

) would look like:

```
(defn transform [f root]
(node/walk root f (partial transform f)))
```

We can use this pattern to implement `sputter.state.trie.util/flatten`

, which’ll
do most of the lifting for `Trie/commit`

.

To avoid clutter, the RLP codec functionality isn’t reproduced:

```
(def ^:private hash-bytes 32)
(defn inline-node? [rlp]
(< (count rlp) hash-bytes))
(defn- walk-realized [f node]
(cond-> node
(node/node? node) (node/walk f (partial walk-realized f))))
(defn- flatten-one [writes node]
(let [rlp (trie.rlp/->rlp node)]
(if (inline-node? rlp)
rlp
(let [hash-k (util/sha3-bytes rlp)]
(swap! writes assoc hash-k rlp)
(rlp/encode hash-k)))))
(defn flatten [root]
(let [writes (atom {})]
(node/walk
root
(fn finish [root]
(let [rlp (trie.rlp/->rlp root)
hash-k (util/sha3-bytes rlp)]
[(swap! writes assoc hash-k rlp) hash-k root]))
(partial walk-realized (partial flatten-one writes)))))
```

`flatten-one`

is called depth-first on all realized descendents of the root
(i.e. every descendent which passes the `node/node?`

check in `walk-realized`

).
Each node is replaced in its parent by `flatten-one`

‘s return value — either an
RLP-encoded hash (for which a corresponding entry exists in `writes`

), or an RLP-encoded
inline node.

`finish`

, inside `flatten`

, returns a vector containing the final
value of `writes`

, the hash of the root node, and new root node with flattened/unrealized
children.

Given the above, `Trie/commit`

itself is trivial:

```
(defrecord KVTrie [root store]
Trie
...
(commit [this]
(let [[writes hash root] (util/flatten root)]
(assoc this :root root
:store (kv/insert-batch store writes)
:hash (util/bytes->hex hash))
```

## Trie/-search

You’ll recall `Trie/-search`

is responsible for returning a lazy sequence of
`[node remaining-key]`

vectors, one for each level descended while searching
for the given key.

Our previous, in-memory implementation of `Trie/-search`

was something like:

```
(->> (iterate #(some->> % (apply node/descend)) [root k])
(take-while identity))
```

Now that it has to do a little more, let’s lift it into a utility which
uses `lazy-seq`

directly:

```
(defn- search-below [store node k]
(lazy-seq
(cons [node k]
(when-let [[child rest-k] (node/descend node k)]
(search-below store (realize child store) rest-k)))))
```

At each step, `child`

is one of three things:

- A realized
`Node`

. - An inline node, represented as an RLP-encoded byte array
- The hash of a node in
`store`

, represented as an RLP-encoded byte array.

`realize`

is a trivial function responsible for taking any of those inputs
and returning a `Node`

. Given `search-below`

, `Trie/-search`

just provides the
root as the starting point:

```
(defrecord KVTrie [root store]
Trie
...
(-search [this k]
(search-below store root k)))
```

And `realize`

, if you’re curious:

```
(defn- hash->vector [hash store]
(-> (kv/retrieve store hash)
rlp/decode))
(defn realize [child store]
(if (node/node? child)
child
(some-> (rlp/decode child)
(cond-> (not (rlp/vector? child)) (hash->vector store))
trie.rlp/vector->node)))
```

`rlp/decode`

will return `nil`

(and short-circuit the `some->`

) if passed
the RLP encoding for a null node.

## Trie/-insert

Previously, insertion was defined in terms of a utility function `walk-path`

,
which visits nodes along some key:

```
(defn- walk-path* [node k outer inner]
(if-let [[child rest-k] (node/descend node k)]
(outer (node/reattach node k (inner child rest-k)) k)
(outer node k #{:terminal?})))
(defn walk-path [f node k]
(walk-path* node k f (partial walk-path f)))
```

We add a `store`

parameter (`KeyValueStore`

) and insert a `realize`

call:

```
(defn walk-path [f store node k]
(walk-path* (realize node store) k f (partial walk-path f store)))
```

And, finally, we modify the definition of `Trie/-insert`

to pass the extra parameter:

```
(defn- attach* [v node k & [opts]]
(cond-> node (:terminal? opts) (node/attach k v)))
(defrecord KVTrie [root store]
Trie
...
(-insert [this k v]
(let [root (util/walk-path (partial attach* v) store root k)]
(assoc this :root root))))
```

# Bonus Round: Merkle Proofs

Ethereum’s representation of a proof is a vector of the RLP-encoded nodes visited when retrieving a given key — inline nodes are not included at the top-level, as they’ll be nested within their parent’s RLP.

As Merkle proofs may be used to demonstrate the *absence* of a key, we
don’t care whether we terminated at an exact match, or the closest
node.

`prove`

(`trie`

, `key`

→ `[rlp ...]`

) retrieves from a trie the
realized nodes along the key’s path, and encodes each in RLP,
returning the result as a vector:

```
(defn prove [trie k]
(let [rlp (map (comp trie.rlp/->rlp first) (-search trie k))]
(into [(first rlp)]
(filter (complement util/inline-node?) (rest rlp)))))
```

Note that `prove`

would fail in `trie.rlp/->rlp`

if called on a trie
with uncommitted writes, as it doesn’t recursively flatten the nodes
received from its search. I think this is basically reasonable behaviour.

```
(defn verify [hash k [head & rest-proof]]
(if-not (and head (Arrays/equals hash (sha3-bytes head)))
::invalid
(let [node (util/bytes->node head)
[next-hash v] (descend-inline node k)]
(if next-hash
(recur next-hash v rest-proof)
(if rest-proof
::invalid
v)))))
```

`verify`

takes a trusted root hash, the key for which we requested a proof, and
a proof vector — returning either `::invalid`

or the value associated with the
key, if any.

`descend-inline`

takes a top-level node from the proof vector and
passes through any inline children before returning the next node’s hash (& remaining key),
or the final value:

```
(defn- descend-inline [node k]
(if-let [[child rest-k] (node/descend node k)]
(if (util/inline-node? child)
(recur (util/bytes->node child) rest-k)
[(rlp/decode child) rest-k])
[nil (node/value node k)]))
```

You made it! The full source is available at nervous-systems/sputter on Github.