We had a number of great proposals again this year from students eager to join the HPCC Systems Summer Intern Program. This year, five students have joined our platform team and are already working on their chosen projects, which include machine learning algorithms and enhancements to the HPCC Systems platform itself. Once completed, you will be able to take advantage of the enhancements they have contributed to our open source project. Here is what is coming your way soon courtesy of our new summer 2017 HPCC Systems platform team members.
There are three machine learning related projects that students are working on this summer.
Lily Xu – PhD student of Computer Science at Clemson University, USA
Lily is a returning student intern who joined the team in the summer of 2016 to implement the Yinyang K-Means algorithm in ECL for the HPCC Systems Machine Learning Library. This year she is building on the work she completed last year and in particular, is looking to optimize this clustering algorithm for large clusters. She is already making some good progress working on implementing a fully functional Yinyang algorithm and running performance tests to establish a baseline.
Lily is keeping a blog journal documenting her progress on this project, which also includes her notes about the work she completed last year. You can also read more about the Yinyang, K-Means algorithm in a blog featuring the work Lily completed in 2016.
George Mathew – PhD student of Computer Science at NSCU, USA
George’s task is to implement a Gradient Trees machine learning algorithm in ECL. This algorithm is used for building predictive models, where one usage example may be to rank results returned by a search engine such as Google.
George has already implemented a light weight version of Gradient Boosting using Linear Regression, which can be extended to a robust implementation by replacing the ordinary least squares linear regression model with a regression tree model once that has also been implemented. He is currently working on implementing Gradient Boosting for classification employing the Gradient Boosting for regression. Next up, he will implement the regression tree based regression and plug it in to the Gradient Boosting framework.
George is keeping a blog journal about his project and intern experience with HPCC Systems.
Sarthak Jain – Recently completed his BTech in Computer Engineering at Delhi Technological University, India
In the fall, he will start his PhD in Computer Science at Northeastern University, MA in the USA. If Sarthak’s name is familiar to you, it because he has worked with us twice before. Once as part of our participation in Google Summer of Code in 2015 and last year he joined the HPCC Systems Summer Intern Program. He has worked on 2 different machine learning algorithms, adding statistics for Linear and Logistic Regression and implementing a Latent Semantic Analysis algorithm.
This year, he is working on implementing a Documentation Generator for ECL code. The HPCC Systems Machine Learning Library is undergoing a major rework at the moment. Families of related machine learning algorithms are being implemented as individual bundles. Sarthak’s project is directly related to this restructuring effort. We want to put a better system in place for generating documentation from developer comments made in the source code and make it searchable. Sarthak’s documentation generator will be a huge asset to our open source project making it a lot easier for contributors to find all the information they need to evaluate and use the algorithms in our machine learning library. Moreover, the impact of this work is not limited to our machine learning library, we fully intend to extend the use of this documentation generator to the rest of the HPCC Systems source code comments.
If you’d like to track Sarthak’s progress on this project, read his blog journal. More information is also available on his previous Latent Semantic Analysis and Logistic and Linear Regression projects.
Two students are working on HPCC Systems platform related projects.
Vivek Nair – PhD student of Computer Science at NCSU, USA
Vivek is a returning student who implemented a machine learning regression suite for the HPCC Systems Machine Learning Library in 2016. This year, Vivek is working on a project which will allow Spark to use HPCC Systems as a datastore and provide ECL programmers with the ability to access Spark algorithms.
Vivek has been focused on processing native (flat) HPCC files from within Apache Spark. He has successfully executed KMeans streaming using Spark’s scalable machine learning library (MLlib) on HPCC files that can fit in memory. His next task is to explore the boundaries of what is possible in that scenario, such as source data size and complexity, additional MLlib functions, etc. At this point, the focus is on allowing a Spark user to access data stored on HPCC Systems. At a later date, he will address the scenario of an ECL user using Spark to process HPCC data.
If you are interested in following Vivek’s progress on this project, he has created a blog journal as a record of his internship experience.
David Skaff – 11th grade student at NSU University School in Florida, USA
David is the youngest student to have joined our internship program. He is also a member of the NSU University School Robotics Team who have designed and built an autonomous robot which won the Amaze Award at the 2017 Vex Robotics World’s Competition earlier this year. We are delighted to welcome him on to the team and we hope his success will inspire confidence in other high school students interested in a career in coding, to apply to complete an internship with us in the future.
David’s project is to provide unicode implementations for HPCC systems standard library functions. Having got himself up and running using HPCC Systems, he has been familiarising himself with GitHub (our sources are here), ECL and reading up about unicode. He has now moved on to working on the test cases required for the first function which will allow him to start coding.
HPCC Systems intern program – 2017 and beyond
This is the third year we have run the HPCC Systems intern program and we are already looking to add new challenges to the list of available projects for 2018 and beyond.
All the projects on our list, contribute significant new features and enhancements to our open source project which benefit our community. As such, they are quite challenging, so it is probably more accurate to say that our interns are considered to be ‘one of the team’ for the 12 weeks they are with us. Each student is appointed a mentor who is an expert in their chosen area and provides guidance and support. They are invited to team meetings to discuss progress and any challenges they have discovered, to get help and support from the wider team.
We value the contributions they make to HPCC Systems and hope each student enhances their learning and experience as a result of working with us. But we also want every student who works with us to enjoy their intern experience. It is a great compliment to the team and especially our mentors, that we receive applications from students who want to return as well as those who are new to HPCC Systems every year.
The proposal period for 2018 will open towards the end of September. It’s never too early to start preparing! Subscribe to our student forum to get notifications. Remember, students can suggest their own project idea but it must leverage HPCC Systems and be of benefit to our open source project and community.