Continuing the series of blogs about the successes of our student interns, this blog focuses on the work of Vivek Nair. Vivek first joined the HPCC Systems team as part of the LexisNexis Corporate Intern Program in 2015, helping to optimise the Random Forest Algorithm in the HPCC Systems Machine Learning Library (ECL-ML). In 2016, he rejoined the team this time as part of the HPCC Systems intern program, with Arjuna Chala as his mentor. He was tasked with working on a proof of concept to produce a regression suite for the HPCC Systems Machine Learning Library and developing a number of machine learning plugins for the Data Science Portal (DSP).
One objective of Vivek’s project was to benchmark the ECL-ML library with other implementations. A happy side effect of this work would be the ability to extract a slimmed down version of these algorithms, to provide a regression test suite tailored specifically for the Machine Learning Library taking under 1 hour to run.
For the purposes of this proof of concept, Vivek focused on a subset of relatively small algorithms to make the verification process easier. The algorithms used were Random Forest, Decision Trees, Logistic Regression, Linear Regression, KMeans and ARIMA. The aim of the regression suite is to be able to compare performance on different software versions, cluster types and sizes. Where the performance is shown to be similar to the benchmark, the test passes.
Another aim was to provide a regression suite that runs fast enough to be complimentary to the development process rather than a hindrance. The comprehensive set of tests take around 4 hours to run, so Vivek has prepared a slimmed down version which runs in 11 minutes.
The HPCC Systems Machine Learning Library is the focus of a major upgrade at the moment. Our ML team has been adding the BLAS functions used by PBBlas to the ECL Standard Library and we are moving towards implementing each family of related ML algorithms as a separate 'bundle'. Each regression test created by Vivek will be fully integrated into the relevant bundle. Currently, we are integrating the regression tests for the Linear Regression Bundle, soon to be followed by the Logistic Regression Bundle.
Vivek's second objective was to add Machine Learning Plugins to the Data Science Portal (DSP), to provide easy access to distributed ML algorithms, while eliminating the need for new/beginner ECL users to learn the more advanced ECL techniques required for this level of data analysis. Vivek needed to adapt the code to work with the DSP and carefully port the existing Machine Learning Library adding four algorithms, Linear Regression, Decision Tree, Random Forest and K-Means.
In starting this work, Vivek has paved the way for us to be able to add more machine learning algorithms to the Machine Learning Plugin for DSP and make it available for prime time usage in production environments.
Vivek is currently pursuing a PhD in Computer Science at North Carolina State University and his thesis explores fast ways to configure software systems. Vivek’s contribution to the HPCC Systems Open Source Project is extremely valuable and not limited to the projects mentioned above. Since completing his summer internship with HPCC Systems in 2016, he has also worked on the following additional projects:
- Building a library to integrate HPCC Systems with Spark. His contribution to this project involved developing a Python package which can be used to import (sampled) data stored in an HPCC Systems cluster to a local machine. He achieved this by using the web services exposed by the HPCC Systems platform.
- He is currently working with a team that is building a FUSE plugin for HPCC Systems. The aim is to create a FUSE client that exposes HPCC Systems files to a remote system as a file system composed of JSON files.
Vivek also took part in the HPCC Systems poster competition at our Engineering Summit in October 2016. He showcased the work he completed on the Machine Learning Regression Suite and Machine Learning Plugins for the DSP, which earned him 2nd place. Congratulations to Vivek on this well-deserved achievement.
The HPCC Systems team thanks Vivek for his hard work on behalf of our community who will benefit from his many valuable contributions to our open source project.
- The Data Science Portal (DSP) is a graphical user interface used to design and implement Big Data workflows and visual dashboards on the HPCC Systems platform. DSP generates ECL code automatically, eliminating the need for users to learn the ECL language.
More information about our HPCC Systems interns of 2016…
- Read about Suk Hwan Hong and Column Level Security on HPCC Systems
- Read about Syed Rahman and the CSCS Machine Learning Algorithm
- Read about Sarthak Jain and the Latent Semantic Analysis Machine Learning Algorithm
- Read about Lily Xu and the YinYang K-Means Clustering Machine Learning Algorithm
- Read about Shweta Oak and Non-negative Matrix Factorization on HPCC Systems
More about internship opportunities...