2020 Intern Contributions to the HPCC Systems Open Source Project

In recent years, the HPCC Systems Intern program has attracted university level students, although the program is open to high school students too. We have had a a few high schoolers join the program in recent years, but we’d like more!

We have been extending our academic outreach to encourage more high schoolers to look at and join our program and we have been visiting schools and getting involved in other programs aimed at young coders, such as CodeDay.

So in 2020, we were particularly pleased to reap the benefits of our efforts by welcoming more high schoolers on to the program than in previous years. It was great to see the achievements of our young, coding superstars, Jack Fields (American Heritage School of Boca/Delray, FL), Jefferson Mao (Lambert School, Suwanee, GA) and Nathan Halliday.

It’s always really pleasing to see that students enjoy working on our intern program so much they return to complete additional projects. This year we had two returning students, Robert Kennedy (PhD Computer Science, Florida Atlantic University) returned for the third year running and Vannel Zeufack (Masters, Computer Science, Kennesaw State University) returned to complete his second internship. Yash Mishra (Masters, Computer Science, Clemson University) who has been working alongside the team for the last 18 months via our Academic Program, joined the intern program to work on a specific cloud related project as an extension to the work already completed.

Academic institutions and their semesters around the world can run to different schedules. Our global intern program is designed to be flexible enough to be able to accommodate these differences. In 2020, we were delighted to welcome Matthias Murray to the program from New College of Florida, where the Masters of Data Science program requires students to complete a 16 week practicum to graduate and a high school student from the UK where the school terms run into July.

What you really want to know is what our Class of 2020 achieved during their internships. So here is a synopsis of their work with links to resources you can use to find out more and comments from the mentors who supported them.

Jack Fields
High School Student
American Heritage School, Florida

Image showing Jack Fields from American Heritage School

Jack was mentored by our LexisNexis Risk Solutions Group colleagues David DeHilster and Xiaoming Wang. Jack joined the team to work on a project that leverages HPCC Systems and machine learning to contribute to the progress being made on his school Autonomous Security Robot. Jack’s work involved updating the Robotics API which was created by a previous student from his school in 2018 (Aramis Tanelus). In addition to this, he was tasked with using the HPCC Systems GNN bundle and TensorFlow to train a model to recognise known faces. This significant task intends to enable the robot to identify potential security risks and threats from unknown campus visitors.

Jack entered our 2020 Technical Poster Contest and won our first ever Community Choice Award. His poster was chosen as the winning entry for this award by our open source community members during our Virtual Community Day Summit in October 2020.

Using the HPCC Systems Generalized Neural Network (GNN) Bundle with TensorFlow to Train a Model to Find Known Faces Leveraging the Robotics API

Use these resources to find out more about Jack’s work:

Mentor Comments from David Dehilster

Jack Fields is the second high school intern who has opened up new doors for HPCC Systems into the open source world of robotics. Jack built upon the previous work of Aramis Tanelus (also from American Heritage School) who wrote the code that makes it easy to take data from robotic sensors and ingest it into HPCC Systems.

Jack took this work a step further by processing the incoming data using machine learning, specifically by using a General Neural Networks package with TensorFlow now provided as part of an open source machine learning bundle. Jack’s work took images of people’s faces captured from camera on a security robot and used HPCC Systems machine learning and neural networks to try and recognize the face as a friend or foe.

With Jack’s help, HPCC Systems continues to push open source robotics software into new areas including big data and machine learning.

Jefferson Mao
High School Student
Lambert High School, Georgia

Photo of Jefferson Mao

Jeff was mentored by our LexisNexis Risk Solutions Group colleagues Xiaoming Wang and Godson Fortil to setup an HPCC Systems cluster on the Google Cloud platform. He took some time to evaluate the new Google Anthos GKE platform and used our own regression tests, adding new ones where necessary, to run cloud specific test in areas such as scaling. He also tested our Helm charts, which were still under ongoing development at the time, contributing some useful feedback that has helped us to improve how they work. Jeff heard about HPCC Systems because of our involvement in CodeDay in 2019 and 2020, during which students completed challenges set by our LexisNexis Risk Solutions Group colleagues.

Jeff entered our 2020 Technical Poster Contest and won the prize in the Best Poster – Use Case Category.

HPCC Systems on the Google Cloud Platform

Use these resources to find out more about Jeff’s work:

Mentor Comments from Xiaoming Wang

Jeff’s Google Anthos was a very challenging project, particularly for someone who just starting out on their software development journey. It required lots of reading and had a steep learning curve. This project involved the type of research type that would not be out of place at Masters degree level.

Jeff took on the challenge and was very self-motivated. He spent lots of time exploring and setting up an environment. With some help he created the HPCC Systems Anthos setup code which you will find in the GitHub projects shown below. The work completed was achieved across three major public cloud platforms, Google Cloud Platform (GCP), AWS and Azure.

Jeff’s work was a very valuable contribution to the HPCC Systems Multiple Cloud development and testing. Find out more about our HPCC Systems Cloud Native Platform using these resources.

Nathan Halliday
High School Student

Photo of Nathan Halliday

Nathan was mentored by our LexisNexis Risk Solutions Group colleague Gavin Halliday. Most ECL users will be familiar with the idea of workunits, containing graphs and subgraphs that are used to execute the ECL queries.  They are probably less familiar with workunits containing multiple workflow items.  These workflow items are used to implement PERSIST, INDEPENDENT and the other workflow services.  The job of determining the order to execute the workflow items, taking care of all the dependencies, belongs to the workflow engine. There is often potential for different workflow items to be executed in parallel (for example evaluating independent persists), but for simplicity the current engine executes one at a time. Nathan took on the challenge of improving this, allowing multiple workflow items to execute at the same time, potentially allowing workunits to run more quickly.

Nathan has now completed his high school education and has moved on to studying an undergraduate degree in Mathematics.

Nathan entered our 2020 Technical Poster Contest and won the prize in the Best Poster – Platform Enhancement Category.

The Parallel Workflow Engine

Use these resources to find out more about Nathan’s work:

Mentor Comments from Gavin Halliday

Nathan’s contributions included a large number of test cases (for more details, look at this pull request and also this pull request) and an implementation of a new workflow engine. The ability to execute multiple workflow items at the same time created situations that could not previously occur (for example multiple graphs running at once, potential deadlock on resources), so many of Nathans other contributions fixed some of these issues (For more information see this pull request about lock acquisition and this one about threading issues.

The code is merged into the 7.12.x branch of the HPCC Systems repository, but has not yet been enabled by default.  There are still a couple of issues with running multiple graphs in ROXIE and Thor that need to be resolved first.  Once those are addressed, there should be a significant number of Thor queries that will potentially run more quickly.

Matthias Murray
Masters, Data Science
New College of Florida

Photo of Matthias Murray - 2020 HPCC Systems Intern

Matthias was mentored by our LexisNexis Risk Solutions Group colleagues Lili Xu, Arjuna Chala and Roger Dev. Matthias’s project involved reporting on the current status of NLP and applications of embeddings trained on SEC filings, while compiling and analyzing SEC filings and their intersection. His project also required the sorting and transformation of SEC data, creating a function to convert the data into a format required by the HPCC Systems Word Vectors ML bundle. Since he completed his project, Matthias has written and submitted a paper about his findings which has been accepted into the 4th IEEE International Workshop on Big Data for Financial News and Data, held in December 2020.

Applying HPCC Systems Word Vectors to SEC Filings

Use these resources to find out more about Matthias’s work:

Mentor Comments from Lili Xu

Matthias’s work is the first intern project using our natural language processing bundle TextVectors to analyze public finance data. It opens the door for modeling finance problem with sentiment analysis in HPCC Systems. It’s also a great resource for students or researchers from different fields who want to apply NLP in their own project.

Robert Kennedy
PhD, Computer Science
Florida Atlantic University

Photo of Robert Kennedy, PhD Students at Florida Atlantic University

Robert joined the HPCC Systems intern program for the third time in 2020. During all three of his internships he has been mentored by our LexisNexis Risk Solutions Group Colleagues Tim Humphrey and Roger Dev with additional support from his university supervisor, Dr Taghi Khoshgoftaar, Motorola Professor at Florida Atlantic University.

Robert’s area of interest is in deep learning and all the intern projects he has completed have focused on this area. He was among the first users to use HPCC Systems with TensorFlow, completing a project in 2018 looking at Distributed Deep Learning on HPCC Systems with Tensorflow (Watch 2018 Community Day Presentation / View SlidesView Poster). In his second internship, Robert looked at Expanding Deep Neural Network Capabilities on HPCC Systems which involved creating a new bundle providing GPU Accelerated Neural Network training features and tools (Watch 2019 Community Day Presentation / View Slides / View Poster / Project Blog Journal).

In 2020, Robert built on his previous work by implementing a Multi-node, Multi-GPU Accelerated Deep Learning Algorithm using the HPCC Systems GNN bundle. During his testing, Robert showed that GPU accelerated GNN training times are significantly faster than when using CPUs. His recommendations also showed that the degree of speedup is dependent on the neural network size and overall size of the training set, confirming his assumption that the larger the neural network model, the greater the effect of the GPU acceleration. While using multiple GPUs has a communication cost, using fewer GPUs is still faster than using a greater number of CPUs.

Robert entered our 2020 Technical Poster Contest and won the prize in the Best Poster – Data Analytics Category.

Implement a Multi-node, Multi-GPU Accelerated Deep Learning Algorithm using GNN

Use these resources to find out more about Robert’s work:

Mentor Comments from Tim Humphrey and Roger Dev

The work that Robert completed during the 2020 summer intern program will benefit anyone wanting to use GNN with a very large dataset/training set. His contribution means that training neural networks is 10 times faster when GPUs are available on the THOR nodes being used. This implementation makes it possible to use available GPUs when using GNN and is a valuable addition to HPCC Systems machine learning capabilities.

Vannel Zeufack
Masters, Computer Science
Kennesaw State University

Photo of Vannel Zeufack, Masters student at Kennesaw State University
Vannel joined the HPCC Systems intern Program for the second year running in 2020. During his internships, Vannel has completed two machine learning projects that have very different applications. In 2019, he focused on a machine learning project that involved looking at Unsupervised Log Based Anomaly Detection. Vannel implemented two anomaly detection processes using the HPCC Systems K-Means bundle (Watch Recording / View Slides / View Poster / Project Blog Journal).
In 2020, Vannel switched the emphasis of his machine learning interest towards contributing a Preprocessing Bundle. The preprocessing of the data is by far the most time consuming part of the whole process for machine learning related projects and the engineers working on them. Vannel’s preprocessing bundle provides modules for processing categorical features, scaling data and functions for normalising data and easily splitting datasets into training and test data. The aim of this bundle is to help our machine learning users to significantly reduce the amount of time needed in the preprocessing phase, so they can move on to the analysis part quickly and efficiently.

Implement a Preprocessing Bundle for the HPCC Systems Machine Learning Library

Use these resources to find out more about Vannel’s work:

Mentor Comments from Lili Xu, Arjuna Chala and Roger Dev

Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time on preprocessing. Thus data preprocessing is the first and critical step along the modeling journey. Vannel’s project enhanced the current preprocessing capability of the HPCC Systems Machine Learning library. It makes data science projects more smooth and complete in HPCC Systems.

Yash Mishra
Masters, Computer Science
Clemson University

Photo of Yash Mishra, Masters Students of Clemson University
Yash is a member of a research team run by Dr Amy Apon that collaborates with us on machine learning related projects. As part of our 2019 collaboration with Clemson University, Yash had been completing some valuable work looking at how our bare metal platform worked on the cloud. This put him in the perfect position to suggest an intern project to contribute to the ongoing development work on our cloud native platform.
Yash’s project involved leveraging the Kubernetes support for HPCC Systems, focusing on performance measurements, cost analysis and looking at the various configuration options, including storage and scaling. During his internship, Yash made a number of contributions to our development team, providing valuable insights and feedback on setting up and using cloud native HPCC Systems clusters with AWS and Microsoft Azure. Yash was mentored by our LexisNexis Risk Solutions Colleague Dan Camper and his university supervisor, Dr Amy Apon, Clemson University.

Leveraging and Evaluating Kubernetes support for HPCC Systems on Microsoft Azure

Use these resources to find out more about Yash’s work:

Mentor Comments from Dan Camper

Yash’s project compared early releases of the HPCC Platform’s new cloud configuration with legacy configurations.  This was quite challenging, because the cloud configuration was under active development at the time.  His work outlined the many architectural changes between the two configurations and he briefly addressed how those changes could impact a runtime environment in terms of cost and performance.  Yash’s project serves as the beginning of a comparison framework that could easily be expanded in different directions, such as how different types of workloads react in different environments.

Congratulations to our Class of 2020 Interns

Having read all about their work, I’m sure you get the very clear message about the value added to the HPCC Systems Open Source Project by our interns. Every single one of these projects makes a positive impact on our platform and community users. Our intern program runs for 12 weeks and I think it is all the more impressive that our interns achieve so much in what is really quite a short snapshot of time.

I’ve managed to get this far into the blog without even mentioning the COVID-19 Pandemic, which just goes to show that some things do manage to go on regardless! Our program was impacted slightly. We started the program a little later than usual and some students needed to finish earlier due to time constraints. All students worked remotely (find out more about that here), and we made adjustments as necessary to build in the need for flexibility. Even with all this going on, 7 students still completed internships with us this year and the quality of the work and the positive attitude of everyone involved was as strong as ever.

The fact that this is true, is testimony to the strong work ethic and dedication of the students as well as the support of their mentors.

Photos of all the interns together and the badge they all received

We thank our interns for their contribution to HPCC Systems in 2020 and everyone who played a part in making 2020 another extremely successful year for the HPCC Systems Intern Program.

The proposal application period for the 2021 HPCC Systems intern Program is now open.

The deadline date for proposal applications is Friday 19th March 2021. Don’t leave it until the last minute! We do award places in advance of that deadline, so get started now!

Find out more about the program, how it works and how to apply by reading this blog or visiting our Student Wiki. To get ideas for a possible project, visit our Available Projects list, watch Tech Talk Webcasts featuring interns or look at the posters illustrating the project work completed.

Have a question about the program? Contact us.