Skip to main content

Here we are well in to August already. Our two HPCC Systems® interns are putting the finishing touches to their projects, completing documentation and submitting evaluations about their time working with us.

Implement the CONCORD algorithm into the HPCC Systems® Machine Learning Library

Syed Rahman’s project is now complete. The CONCORD algorithm is a method to estimate the true population of a co-variance matrix. The co-variance matrix is a summary of the relationship between every pair of fields in the data. Co-variance values close to zero indicate that the fields don’t have a relationship. Values close to 1 indicate a positive relationship and values close to –1 indicate an inverse relationship.

In classic statistics there are many more observations than fields. In this case, the co-variance matrix of the sample is a good estimate for the true co-variance matrix.

Unfortunately, in big data, there any many cases where the number of fields exceeds the number of observations or may be close to the number of observations. It is the case that the sample co-variance matrix is a very poor estimate for the true co-variance matrix.

It’s clear that Syed’s addition to our Machine Learning Library is an important improvement, providing a way to getting more reliable results in this area.

Syed is speaking at the HPCC Systems® Engineering Summit at the end of September this year. His presentation demonstrates how this algorithm works and why it is a better method of getting the true population of a co-variance matrix. I’ll post a link to the recording of his presentation as soon as it is available.

Improving Child Query Processing Project
Anshu Ranjan has also just completed his time as an intern with us. He’s been working on this project with his mentors Gavin Halliday and Jamie Noss.

While this project has had limited success there are many reasons why this should be the case. The code generator is pretty much the engine room of the system and as a result, not only do you need an eye for detail but, you also need a good overview of the entire system and how each component interacts with others. Gavin Halliday is our resident expert in this area and through years of experience knows it inside out. So we know that it is a complex and challenging area.

One of the things we hoped to get out of this project was some feedback on how to improve the internal documentation for developers so that others in the future can contribute to the codebase. Having a student working in this area has certainly helped us to highlight some specific improvements we can make to our internal documentation and we have already made some as a result of Anshu joining the team.

And finally...

Our thanks go to Syed and Anshu for contributing to HPCC Systems®. It's been great having them work with us and we appreciate the work they have done.

We also wish them well while they complete their studies and decide on a future career. Perhaps our paths will cross again sometime in the future!

Notes:

1. Read Syed Rahman's blog to find out more about the CONCORD algorithm.

2. To find out more about the HPCC Systems® Machine Learning Library see the Machine Learning Library Reference.

3. Read Anshu Ranjan's blog to find out more about the Improving Child Query Processing project.