CR-002 - Blockchain Analytics with Cayley DB

*Published: 2019-10-21* Bitcoin (and consequently, the Blockchain) have been making waves in the media over the past few years. In this blog we will be covering the process of building relationships between blocks, transactions and addresses using Google’s Cayley DB. With this information we may be able to pinpoint important transfers of value and also build an ownership database on top of it to track high value individuals. I’m going to call this project Bayley, an amalgamation of “Bitcoin” and “Cayley”. I never was that creative. I actually renamed it to bit-cerebro, but that might be a little too ominous. # The Blockchain So what are the advantages of the Blockchain? What’s so special about it? For starters, for the first time in history we have a database that: - Which can’t easily be rewritten - Eliminates the need for trust - Resists censorship - Is widely distributed. A lot of people like to call it a distributed ledger. This is true if you use Bitcoin as a currency. It is that, but can be used for a lot more. As a new and possibly disruptive technology I figured it would be a good idea to learn more about it. In the process I also might glean enough of its processes for building unique services on top of the Blockchain. # The Database I originally tried to work with this project using mongodb. I ended up shelving the idea as mongo is not suitable for this task. The schema is consistent across blocks and I need to be able to easily find relationships between datapoints. I had a look at [levelgraph](https://github.com/mcollina/levelgraph) and [neo4j](http://neo4j.com/) but in the end decided to go with [Cayley](https://github.com/google/cayley). Cayley has been explored [previously by The Frontier Group](http://blog.thefrontiergroup.com.au/2014/10/look-cayley/), is a very new technology and I wanted to learn how to use it. # Setup Considerations The first step will be to synchronise a copy of the blockchain locally for your use. I used the testnet instead of mainnet for testing purposes. Originally I used [BTCD](https://bitcoin.org/en/download) as I wanted a server-side, headless daemon. Bitcoin Core can do this, but not in OSX. I constantly ran into bugs and inconsistencies such as: - RPC setup using a different variable names making existing libraries that hook into Bitcoin Core useless - JSON batch requests not supported In the end I just opted to run an instance (with GUI and all) of Bitcoin Core on my machine. [Get it here!](https://bitcoin.org/en/download) Before starting to synchronise the Blockchain it might be useful to note that transaction data is not saved in the local blockchain to conseve disk space. Transaction indexing can be turned on with the commandline switch `-txindex` or adding the line `txindex=1` to your bitcoin.conf. RPC needs to be enabled. Using RPC calls to the Bitcoin daemon will allow you to pull out the block data. # Spinup instructions - Install [Bitcoin Core](https://bitcoin.org/en/download) - Install [Cayley DB](https://github.com/google/cayley) - Install [Pycharm Community](https://www.jetbrains.com/pycharm/) - Install python-bitcoinlib - Install requests # Overview of Process From a high level, the process will look like this: - Get block hashes from height - Get blocks from block hashes - Send an HTTP POST request to Cayley DB of the above data This does not take into account transaction data either. That will be a topic for a future blog post. So lets get started! # Setting up Bitcoin Core The Bitcoin Core standard conf file has a lot of stuff in there, but in general you’ll need to make sure the following lines are as follows: ``` txindex=1 testnet=1 server=1 rpcuser=bitcoinrpc rpcpassword=BHaVEDoMkVr1xKudcLpVbGi2ctNJsseYrsuDufZxwEXb rpcport=8332 ``` The rpcpassword is autogenerated by Bitcoin Core. You can use an environment variable if you’re concerned about security and such. Since this project is just for testing purposes and the password is randomised, I’m not too bothered that its sitting there in plaintext. # Block Extraction We’ll be using Peter Todd’s [python-bitcoinlib](https://github.com/petertodd/python-bitcoinlib) library. Install this using Pycharm, then add to the top of your bayley.py file: ``` import bitcoin import bitcoin.rpc ``` The next step will be to write some simple code to extract some blocks. ``` def main(): # This will create a batch of commands that requests the first 100 blocks of the blockchain commands = [ {"method": "getblockhash", "params": [height]} for height in range(0, 100 ] # Connect to the RPC server, send the commands and assign to the results variable conn = bitcoin.rpc.RawProxy() results = conn._batch(commands) # Extract the hashes out of the result block_hashes = [res['result'] for res in results] # Prepare to extract specific block data blocks = [] for hash in block_hashes: blocks.append(conn.getblock(hash)) # Call the function to make the triples to prepare for importing to CayleyDB block_triples = make_triples_for_block(blocks) ``` # Block Structure Here is an example of a single block’s data: ``` {'bits': '1d00ffff', 'chainwork': '0000000000000000000000000000000000000000000000041720ccb238ec2d24', 'confirmations': 1, 'difficulty': Decimal('1.00000000'), 'hash': '0000000084ee00066214772c973896dcb65946d390f64e5d14a1d38dfa2e4d90', 'height': 445610, 'merkleroot': 'eaf042fa845ea92aba661632bc6b8e78e8e64c2917a92f1a7da0800ed793b819', 'nonce': 1413010373, 'previousblockhash': '0000000087a272f48c3785de422e232c0771e2120c8fdd741a19ea98d122132b', 'size': 315, 'time': 1432705094, 'tx': ['eaf042fa845ea92aba661632bc6b8e78e8e64c2917a92f1a7da0800ed793b819'], 'version': 3} ``` With this in mind we can begin working on pulling the data from the blockchain and parsing the specific blocks. # Making Triples Cayley uses the subject, predicate, object system, known as a [triplestore](https://en.wikipedia.org/wiki/Triplestore). We need to parse the block data from the previous section into this triplestore format. One of the limitations of the triplestore is that you can not add much metadata to each node. Array indexing and similar are a problem in this regard. In this case we will use the blockhash as the subject for all block data, the key value for all predicates, and the block data (excluding the block hash) as the object variable. Lets create a function that does this: At the top of my `bayley.py` file I will create a global variable which specifies which key value pairs for which I want to create a triplestore. ``` DESIRED_BLOCK_KEYS = ("height", "nextblockhash", "previousblockhash", "size", "time", "difficulty") ``` Next I wish to declare the function: ``` def make_triples_for_block(blocks): triples = [] ``` We will next need to iterate through the blocks and their respective keys to start pulling the relevant data. The first thing to do is to ignore the blockhash key: ``` for block in blocks: for key in block: # Ignore self reference if (key == "hash"): continue ``` The transactions value is an array so its best to iterate through these separately. ``` # Iterate through transactions if (key == "tx"): for t in block[key]: triples.append({ "subject": block['hash'], "predicate": key, "object": t }) ``` And finally we can now append our block data to the triples array we declared in the beginning. Note how I casted the values to strings, this was to prevent an issue later on when you want to import into CayleyDB. Cayley is happiest when you give her JSON files that are all strings. ``` if (key in DESIRED_BLOCK_KEYS): triples.append({ "subject": str(block['hash']), "predicate": key, "object": str(block[key]) }) return triples ``` So now we have a `triples` variable returned which contains all of our triples ready for importing! Here is an example of the triples for a single block for your reference: ``` [{'object': '1', 'predicate': 'height', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}, {'object': '190', 'predicate': 'size', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}, {'object': 'f0315ffc38709d70ad5647e22048358dd3745f3ce3874223c80a7c92fab0c8ba', 'predicate': 'tx', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}, {'object': '000000006c02c8ea6e4ff69651f7fcde348fb9d557a06e6957b65552002a7820', 'predicate': 'nextblockhash', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}, {'object': '1.00000000', 'predicate': 'difficulty', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}, {'object': '000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943', 'predicate': 'previousblockhash', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}, {'object': '1296688928', 'predicate': 'time', 'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}] ``` # Setting up Cayley After downloading Cayley, all its dependencies (and setting up golang in the process), boot up an instance of Cayley. This is my Cayley config file: ``` { "database": "bolt", "db_path": "./blockchain", "read_only": false, "replication_options": { "ignore_duplicate": true, "ignore_missing": true } } ``` I’m using bolt db over leveldb because bolt is slightly better for high reads. You can read more [here](https://www.progville.com/go/bolt-embedded-db-golang/). After making the `cayley.cfg` file, initialise the database by running the init command like so (from the Cayley folder): ``` ./cayley init -config cayley.cfg ``` This will create a `blockchain` file and prep the backend database for Cayley goodness. The next step will be to run the HTTP server: ``` ./cayley http -config cayley.cfg ``` Now we’re ready to send all the data in! # Sending to Cayley Cayley’s HTTP documentation [will help](https://github.com/google/cayley/blob/master/docs/HTTP.md) with this section. It receives JSON triples in the form of the following: ``` [{ "subject": "Subject Node", "predicate": "Predicate Node", "object": "Object node", "label": "Label node" // Optional }] // More than one quad allowed. ``` You’ll need to send this to the `/api/v1/write` URL (appended to localhost:64210 assuming you’re using Cayley’s standard settings). Now we need to make use of the excellent requests python library. Install it in Pycharm then add the following to the top of the `bayley.py` file. Cayley is expecting a json file so we’ll also need to install and import that. You’ll also want to put in a global variable there for Cayley’s URL and also tell Cayley that we’re sending a JSON file. ``` import requests import json DB_WRITE_URL = "http://127.0.0.1:64210/api/v1/write" DB_WRITE_HEADERS = {'Content-type': 'application/json'} ``` We’re going to create a function to send the data over to Cayley. Note how the data is converted to json in the `data=` argument. ``` def send_data(data): r = requests.post(DB_WRITE_URL, data=json.dumps(data), headers=DB_WRITE_HEADERS) pp(r) pp(r.text) ``` If the `pp(r)` prints out `<response 200>` then we’re good! If not then we’ll need to look at what went wrong which is usually explained well in the `r.text` variable. This is the result I got: ``` <Response [200]> '{"result": "Successfully wrote 693 quads."}' ``` Go back to your main function and call the `send_data` function: ``` def main(): ... send_data(block_triples) ... ``` And that should do it. # Graphing the result By now we should have 100 blocks in Cayley! Head over to `http://localhost:64210` and lets start graphing! In the query page we can test out our queries. I wrote a simple one that loops through the first 5 blocks, gets all objects that are one edge away (`Out()`) and gets the result: ``` for(var i=0; i<5; i++){ g.V().Has("height", String(i)).Tag("source").Out().Tag("target").All(); } ``` Here is the result of the first block: ``` { "result": [ { "id": "4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b", "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943", "target": "4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b" }, { "id": "1296688602", "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943", "target": "1296688602" }, { "id": "0", "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943", "target": "0" }, { "id": "285", "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943", "target": "285" }, { "id": "1.00000000", "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943", "target": "1.00000000" }, { "id": "00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206", "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943", "target": "00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206" } ] } ``` The query shape looks like this: ![[Pasted image 20230510203657.png]] The visualisation itself looks like these following images. Single block: ![[Pasted image 20230510203701.png]] Five blocks: ![[Pasted image 20230510203706.png]] As you can see there are shared nodes, this is because the nodes have the same predicate and objects, but different subject (blockhash). This is a good example of how cayley helps in visualising relationships. # Conclusion This is just the early stage. The next step will be to parse the transactions for Bitcoin addresses and start drawing all the relationships between them. Web scraping usernames for addresses and also estimating relationships based on round number transferring of value also is in the pipeline.