Breaking Down Bitcoin Blockchain using HPCC Systems
Rohan Maheshwari is a student at the RV College of Engineering in Bengaluru, India. He is a Bachelor student studying Computer Science and Engineering. He has a keen interest in Deep Learning, Natural Language Processing, Graph modelling and Machine Learning. His goal is to use these knowledge bases to revolutionize the world of finance and sentiment analysis. He has worked under the Samsung PRISM program to create a code-mixed multi-intent classification system. Rohan has also worked with SCII to create an invoice extraction system. While studying at the RV College of Engineering he is also actively working with the LexisNexis® Risk Solutions HPCC Systems® team and the RV College of Engineering Centre of Excellence on Cognitive Intelligent Systems for Sustainable Solutions, to investigate block data stored on the blockchain to gain insight and build relationships between transactions that may shed light on potential criminal transactions.
In the finance world, the percentage of digital transactions are increasing exponentially. A good portion of these digital transaction are being made using cryptocurrency, and that practice is rapidly becoming a commonality for everyday transactions. Although other cryptocurrencies have come into existence, Bitcoin has reigned supreme since its inception.
Bitcoin was created because of the turmoil of the Great Recession of 2008 as distrust of the banking system and their role in the financial system grew. It is a decentralized network consisting of independent nodes where transactions are consensus based by the users of the network. It does not have the backing of any government, nor does it have the bankers to enforce or propagate its use. Perhaps the biggest reason it remains the most popular cryptocurrency is because of the robust Blockchain technology that it is built on. Blockchain creates an immutable record of transactions with end-to-end encryption which prevents fraud and unauthorized activities. While the technology prevents fraud on the network, there are no checks to track how the bitcoins are being used and for what purpose. It is concerning if it ends up being used for racketeering, trafficking, money laundering or any other illegal activity.
Our project aims to investigate the block data stored on the Bitcoin blockchain to gain insight and find correlations between the data that can shed light on the transactions. This is achieved by using the HPCC Systems big data analytics platform to ingest the data. By building rich relationships between the transactions to detect the anomalies in the blockchain network, we hope that it can be used by the investigators to detect criminal activity.
A look at our Blockchain data, dataset, and parser
The bitcoin blockchain is essentially a distributed database that consists of a constantly growing list of all the Bitcoin transactions and records since the date of its initial release in January of 2009. The approximate size of the Bitcoin blockchain now is around 410 gigabytes. Our project aims to use this entire blockchain data, right from January 2009 to now.
Block data on the blockchain consists of information about the location and other data related to the transactions contained within that block. It also contains timestamp information, the nonce (the number that is required to be solved by miners) and the difficulty level. However, since our aim is to build rich relationships between the transactions, we need to extract a certain set of features from the blockchain data which can act as the base required to implement further models and techniques to build rich relationships. These features are:
- The transaction hash
- The input address
- The output address
- The transaction amount (in satoshi)
- The timestamp of the transaction
Extracting these features will end up requiring two or three times the space taken by the blockchain data and will require a substantial amount of time. Hence, to solve this problem we use a special parser to obtain the required features from the block data. The special parser we use is a Python + ECL based parser. ECL is the Enterprise Control Language designed specifically for huge data projects using the HPCC Systems platform. The incorporation of ECL into the parser helps accelerate the parsing process and spray the data onto the HPCC Systems cluster. The Python code is used to convert the raw hexadecimal block data into human readable data. The concerning issue being that the blockchain does not store the input address, but instead stores the input transaction and the corresponding output that acts as input to the current transaction.
This confusing pair is however not the data we need. To find the input address we would have to locate the previous incoming transaction and match the output address. If this portion of the processing was done using ECL as indexing in python would have been quite inefficient.
Anomaly detection, trimmed K-Means clustering, feature extraction and implementation
Even though Blockchain technology prevents fraudulent behavior, it cannot detect fraud on its own. Therefore, it is important to use anomaly detection for identifying potential scams. Clustering is one of the famous techniques for anomaly detection and among them, K-means is one of the most popular algorithms. In our project, instead of detecting the anomaly of individual addresses and wallets, we examined the anomaly of users. Since users carrying out illicit activities mainly use multiple wallet addresses, it is more efficient to choose a method that can examine the user’s behavior instead of the wallet address. This has been achieved by using the trimmed K-means algorithm for clustering. The trimmed K-means is based on partial trimming that is more robust than classical K-means clustering.
To carry out this approach we decided on using ten specialized features based on our parsed dataset. These features were extracted by executing ECL programs and then using their outputs as the input to our trimmed K-Means algorithm. These ten specialized features were:
FEATURE | DEFINITION | |
1 | Average amount incoming | The average amount of bitcoins received to the address of the user’s wallet |
2 | Average Amount outgoing | The average amount of bitcoins sent to the user wallet address |
3 | Total amount sent | The total amount of bitcoins sent to the user’s wallet address |
4 | Total amount received | The total amount of bitcoins received to the address of the user’s wallets |
5 | Standard deviation received | The standard deviation of the number of bitcoins received to the address of the user’s wallets |
6 | Standard deviation sent | The standard deviation of the number of bitcoins sent to the user’s wallet address |
7 | Average neighborhood (In-in) | The average neighborhood of inputs to inputs of all outputs |
8 | Average neighborhood (In-out) | The average neighborhood of inputs to outputs of all outputs |
9 | Average neighborhood (Out-in) | The average neighborhood of outputs to inputs of all outputs |
10 | Average neighborhood (Out-out) | The average neighborhood of outputs to outputs of all outputs |
Our trimmed K-Means clustering approach succeeds in detecting anomalies of potentially suspicious users having multiple wallet addresses.
Time series approach
This approach tried to find anomalies in all the addresses. But what if we already knew the pattern of behavior of a fraudulent address? This is where our second approach to the problem is introduced. Just as a person’s fingerprint acts as a unique biometric identifier of a person, the time series data generated by their footsteps can also be used to identify people with a high degree of certainty.
A similar time series data is generated by a person’s transactions using a given address on a blockchain. The overall strategy here is to identify the time series data of a fraudulent address’ and then compare the time series data of another address to this and ask the following question. Do the two addresses extract data from the same underlying distribution? To help with this line of enquiry we employ the use of chaotic systems. The same systems that model ocean turbulence!
The assumption is that all time series extract data from some underlying chaotic attractor which will model a person’s unique behavior and so if we can prove that two time series data are from the same distribution, and one addresses has been established as a fraudulent one we can say that there is something fraudulent about the transactions made by the other address
The two concepts used here are the Takens theorem and the multivariate wald-wolfowitz. The first is used to embed the time series data and the latter is used to test for similarity in underlying distribution
Additions for the future
One issue with this solution is that timing is only available at a block level and not at the transaction level. There also seem to be ways to associate external data such as IP address of the user to an address. The use of this is not limited to only bitcoin and could be used in other similar areas such as credit fraud detection. Hopefully solutions can be found as we move forward so we are able fully comprehend this data.
Learn more about our collaboration with RV College of Engineering in India and watch a video of the 2022 launch of the RV College of Engineering Centre of Excellence on Cognitive Intelligent Systems for Sustainable Solutions.