This is the second in a series of blogs featuring a student who worked on an HPCC Systems project in the summer of 2016. Syed Rahman is working towards a PhD in Statistics at the University of Florida and is a returning student intern. In 2015, he implemented a machine learning algorithm in ECL for the HPCC Systems open source project. Having been impressed with this work, we were delighted to welcome Syed back on to the team this year to work on another project.
In 2015, Syed implemented the CONCORD algorithm. In big data, there are many cases where the number of fields either exceeds or is close to the number of observations, which makes the sample covariance matrix a poor estimate for the true covariance matrix. The CONCORD algorithm, provides a way to more accurately estimate the true population of a co-variance matrix. Watch Dr Khare and Syed talk about this algorithm in depth on Community day at the 2015 HPCC Systems Engineering Summit. You can also read Syed’s blog he created to track his progress while working on the CONCORD project.
This year, Syed has been working as an HPCC Systems intern on the Convex Sparse Cholesky Selection (CSCS) machine learning algorithm, which is related to and builds on the CONCORD algorithm.
One purpose of this algorithm is to be able to show causal inferences between variables using a Directed Acyclic Graph (DAG). Take the example of exam scores for a class of 30 students, looking at four subjects; maths and physics, English and history. A DAG can be created to map maths to physics, showing how those who perform well in maths are also likely to do well in physics. A similar map between English and history might also imply that if a student performs well in English, they are also likely to do well in history. This method can also be used ‘in the real world’ for mapping cell signalling pathways, showing how cells perceive and respond to their microenvironment in terms of cell development, tissue repair and immunity. By understanding cell signalling, in terms of how cells behave when working correctly and what happens when they malfunction, we may be able to treat diseases such as cancer more effectively.
Another use is to provide an efficient way to estimate a positive, definite, inverse covariance matrix, especially when the number of observations is smaller than the number of variables.
The main advantage of the CSCS algorithm over the CONCORD algorithm is that it is inherently parallelised. Each row can be calculated independently of the other rows, which makes it possible to fully utilise the power of distributed computing to calculate the rows in parallel.
If you want to know more about the principles and implementation of the CSCS algorithm, listen to the presentation that Dr Khare and Syed gave at the 2016 HPCC Systems Engineering Summit. For those of you who want to take some time to peruse their method and calculations in more depth, view their presentation slides.
Syed also submitted an entry into our poster competition held at the 2016 HPCC Systems Engineering Summit which showcased his work on the CSCS project. Syed was the third place winner in what was a close competition with a high standard of entries. It’s always a real delight to meet our student interns in person, especially those who work remotely, like Syed.
In case you don’t know this, students are free to suggest projects to us that they would like to complete as part of an internship with HPCC Systems. In fact, we positively encourage this approach. The only requirement is that the project must be related to and/or complimentary to the HPCC Systems open source project and our community. As with the projects on our own list, students must supply a proposal providing details about the suggested project and a timeline for the work to be done during the internship. We consider these proposals on an equal footing alongside those submitted for projects on our list.
Both the CONCORD and CSCS algorithm projects were suggested to us by Syed and his supervising professor at the University of Florida, Dr Kshitij Khare. Both have been very successful and we are pleased to accept these contributions into the HPCC Systems machine learning library for the benefit of all our users.
Our thanks go to Syed for completing another successful internship on the HPCC Systems open source project and our congratulations also for achieving a well-deserved 3rd place in our poster competition. Thanks also to Dr Khare, University of Florida and Syed’s HPCC Systems mentor, John Holt, for supporting him during both internship experiences.
More information about our HPCC Systems interns of 2016…
- Read about Suk Hwan Hong and Column Level Security on HPCC Systems
- Read about Sarthak Jain and the Latent Semantic Analysis Machine Learning Algorithm
- Read about Lily Xu and the YinYang K-Means Clustering Machine Learning Algorithm
- Read about Vivek Nair and his machine learning regression suite and ML plugins for the Data Science Portal
- Read about Shweta Oak and Non-negative Matrix Factorization on HPCC Systems
More about internship opportunities…