2020 Intern Contributions to the HPCC Systems Open Source Project
In recent years, the HPCC Systems Intern program has attracted university level students, although the program is open to high school students too. We have had a a few high schoolers join the program in recent years, but we’d like more!
We have been extending our academic outreach to encourage more high schoolers to look at and join our program and we have been visiting schools and getting involved in other programs aimed at young coders, such as CodeDay.
So in 2020, we were particularly pleased to reap the benefits of our efforts by welcoming more high schoolers on to the program than in previous years. It was great to see the achievements of our young, coding superstars, Jack Fields (American Heritage School of Boca/Delray, FL), Jefferson Mao (Lambert School, Suwanee, GA) and Nathan Halliday.
It’s always really pleasing to see that students enjoy working on our intern program so much they return to complete additional projects. This year we had two returning students, Robert Kennedy (PhD Computer Science, Florida Atlantic University) returned for the third year running and Vannel Zeufack (Masters, Computer Science, Kennesaw State University) returned to complete his second internship. Yash Mishra (Masters, Computer Science, Clemson University) who has been working alongside the team for the last 18 months via our Academic Program, joined the intern program to work on a specific cloud related project as an extension to the work already completed.
Academic institutions and their semesters around the world can run to different schedules. Our global intern program is designed to be flexible enough to be able to accommodate these differences. In 2020, we were delighted to welcome Matthias Murray to the program from New College of Florida, where the Masters of Data Science program requires students to complete a 16 week practicum to graduate and a high school student from the UK where the school terms run into July.
What you really want to know is what our Class of 2020 achieved during their internships. So here is a synopsis of their work with links to resources you can use to find out more and comments from the mentors who supported them.
Jack Fields
High School Student
American Heritage School, Florida
Jack was mentored by our LexisNexis Risk Solutions Group colleagues David DeHilster and Xiaoming Wang. Jack joined the team to work on a project that leverages HPCC Systems and machine learning to contribute to the progress being made on his school Autonomous Security Robot. Jack’s work involved updating the Robotics API which was created by a previous student from his school in 2018 (Aramis Tanelus). In addition to this, he was tasked with using the HPCC Systems GNN bundle and TensorFlow to train a model to recognise known faces. This significant task intends to enable the robot to identify potential security risks and threats from unknown campus visitors.
Jack entered our 2020 Technical Poster Contest and won our first ever Community Choice Award. His poster was chosen as the winning entry for this award by our open source community members during our Virtual Community Day Summit in October 2020.
Using the HPCC Systems Generalized Neural Network (GNN) Bundle with TensorFlow to Train a Model to Find Known Faces Leveraging the Robotics API
Use these resources to find out more about Jack’s work:
- Tech Talk Presentation, August 2020
- View his prize winning poster
- Community Day Presentation 2020
- Jack’s Project Blog Journal – HPCC ROS GNN
Mentor Comments from David Dehilster
Jack Fields is the second high school intern who has opened up new doors for HPCC Systems into the open source world of robotics. Jack built upon the previous work of Aramis Tanelus (also from American Heritage School) who wrote the code that makes it easy to take data from robotic sensors and ingest it into HPCC Systems.
Jack took this work a step further by processing the incoming data using machine learning, specifically by using a General Neural Networks package with TensorFlow now provided as part of an open source machine learning bundle. Jack’s work took images of people’s faces captured from camera on a security robot and used HPCC Systems machine learning and neural networks to try and recognize the face as a friend or foe.
With Jack’s help, HPCC Systems continues to push open source robotics software into new areas including big data and machine learning.
Jefferson Mao
High School Student
Lambert High School, Georgia
Jeff was mentored by our LexisNexis Risk Solutions Group colleagues Xiaoming Wang and Godson Fortil to setup an HPCC Systems cluster on the Google Cloud platform. He took some time to evaluate the new Google Anthos GKE platform and used our own regression tests, adding new ones where necessary, to run cloud specific test in areas such as scaling. He also tested our Helm charts, which were still under ongoing development at the time, contributing some useful feedback that has helped us to improve how they work. Jeff heard about HPCC Systems because of our involvement in CodeDay in 2019 and 2020, during which students completed challenges set by our LexisNexis Risk Solutions Group colleagues.
Jeff entered our 2020 Technical Poster Contest and won the prize in the Best Poster – Use Case Category.
HPCC Systems on the Google Cloud Platform
Use these resources to find out more about Jeff’s work:
- Tech Talk Presentation, August 2020
- View his prize winning poster
- Jeff’s Project Blog Journal
- HPCC Systems Blog Post
- HPCC Systems Anthos – GitHub Repository
- HPCC Systems Anthos Setup – GitHub Repository
Mentor Comments from Xiaoming Wang
Jeff’s Google Anthos was a very challenging project, particularly for someone who just starting out on their software development journey. It required lots of reading and had a steep learning curve. This project involved the type of research type that would not be out of place at Masters degree level.
Jeff took on the challenge and was very self-motivated. He spent lots of time exploring and setting up an environment. With some help he created the HPCC Systems Anthos setup code which you will find in the GitHub projects shown below. The work completed was achieved across three major public cloud platforms, Google Cloud Platform (GCP), AWS and Azure.
Jeff’s work was a very valuable contribution to the HPCC Systems Multiple Cloud development and testing. Find out more about our HPCC Systems Cloud Native Platform using these resources.
Nathan Halliday
High School Student
Nathan was mentored by our LexisNexis Risk Solutions Group colleague Gavin Halliday. Most ECL users will be familiar with the idea of workunits, containing graphs and subgraphs that are used to execute the ECL queries. They are probably less familiar with workunits containing multiple workflow items. These workflow items are used to implement PERSIST, INDEPENDENT and the other workflow services. The job of determining the order to execute the workflow items, taking care of all the dependencies, belongs to the workflow engine. There is often potential for different workflow items to be executed in parallel (for example evaluating independent persists), but for simplicity the current engine executes one at a time. Nathan took on the challenge of improving this, allowing multiple workflow items to execute at the same time, potentially allowing workunits to run more quickly.
Nathan has now completed his high school education and has moved on to studying an undergraduate degree in Mathematics.
Nathan entered our 2020 Technical Poster Contest and won the prize in the Best Poster – Platform Enhancement Category.
The Parallel Workflow Engine
Use these resources to find out more about Nathan’s work:
- Tech Talk Presentation, August 2020
- View his prize winning poster
- Nathan’s Project Blog Journal
- HPCC Systems Blog Post
Mentor Comments from Gavin Halliday
Nathan’s contributions included a large number of test cases (for more details, look at this pull request and also this pull request) and an implementation of a new workflow engine. The ability to execute multiple workflow items at the same time created situations that could not previously occur (for example multiple graphs running at once, potential deadlock on resources), so many of Nathans other contributions fixed some of these issues (For more information see this pull request about lock acquisition and this one about threading issues.
The code is merged into the 7.12.x branch of the HPCC Systems repository, but has not yet been enabled by default. There are still a couple of issues with running multiple graphs in ROXIE and Thor that need to be resolved first. Once those are addressed, there should be a significant number of Thor queries that will potentially run more quickly.
Matthias Murray
Masters, Data Science
New College of Florida
Matthias was mentored by our LexisNexis Risk Solutions Group colleagues Lili Xu, Arjuna Chala and Roger Dev. Matthias’s project involved reporting on the current status of NLP and applications of embeddings trained on SEC filings, while compiling and analyzing SEC filings and their intersection. His project also required the sorting and transformation of SEC data, creating a function to convert the data into a format required by the HPCC Systems Word Vectors ML bundle. Since he completed his project, Matthias has written and submitted a paper about his findings which has been accepted into the 4th IEEE International Workshop on Big Data for Financial News and Data, held in December 2020.
Applying HPCC Systems Word Vectors to SEC Filings
Use these resources to find out more about Matthias’s work:
- Tech Talk Presentation, September 2020
- View the poster submitted into our 2020 Poster Contest
- HPCC Systems Blog Post
- View the sources for this project in EDGAR Sec Filings stored in the HPCC Systems Platform GitHub Repository.
Mentor Comments from Lili Xu
Matthias’s work is the first intern project using our natural language processing bundle TextVectors to analyze public finance data. It opens the door for modeling finance problem with sentiment analysis in HPCC Systems. It’s also a great resource for students or researchers from different fields who want to apply NLP in their own project.
Robert Kennedy
PhD, Computer Science
Florida Atlantic University
Robert joined the HPCC Systems intern program for the third time in 2020. During all three of his internships he has been mentored by our LexisNexis Risk Solutions Group Colleagues Tim Humphrey and Roger Dev with additional support from his university supervisor, Dr Taghi Khoshgoftaar, Motorola Professor at Florida Atlantic University.
Robert’s area of interest is in deep learning and all the intern projects he has completed have focused on this area. He was among the first users to use HPCC Systems with TensorFlow, completing a project in 2018 looking at Distributed Deep Learning on HPCC Systems with Tensorflow (Watch 2018 Community Day Presentation / View Slides/ View Poster). In his second internship, Robert looked at Expanding Deep Neural Network Capabilities on HPCC Systems which involved creating a new bundle providing GPU Accelerated Neural Network training features and tools (Watch 2019 Community Day Presentation / View Slides / View Poster / Project Blog Journal).
In 2020, Robert built on his previous work by implementing a Multi-node, Multi-GPU Accelerated Deep Learning Algorithm using the HPCC Systems GNN bundle. During his testing, Robert showed that GPU accelerated GNN training times are significantly faster than when using CPUs. His recommendations also showed that the degree of speedup is dependent on the neural network size and overall size of the training set, confirming his assumption that the larger the neural network model, the greater the effect of the GPU acceleration. While using multiple GPUs has a communication cost, using fewer GPUs is still faster than using a greater number of CPUs.
Robert entered our 2020 Technical Poster Contest and won the prize in the Best Poster – Data Analytics Category.
Implement a Multi-node, Multi-GPU Accelerated Deep Learning Algorithm using GNN
Use these resources to find out more about Robert’s work:
- Tech Talk Presentation, September 2020
- View his prize winning poster
- Robert’s Project Blog Journal
- Community Day Presentation 2020
- View the sources of his work in Robert’s GNN-GPU GitHub Repository.
Mentor Comments from Tim Humphrey and Roger Dev
The work that Robert completed during the 2020 summer intern program will benefit anyone wanting to use GNN with a very large dataset/training set. His contribution means that training neural networks is 10 times faster when GPUs are available on the THOR nodes being used. This implementation makes it possible to use available GPUs when using GNN and is a valuable addition to HPCC Systems machine learning capabilities.
Vannel Zeufack
Masters, Computer Science
Kennesaw State University
Implement a Preprocessing Bundle for the HPCC Systems Machine Learning Library
Use these resources to find out more about Vannel’s work:
- Tech Talk Presentation, September 2020
- View the poster entered into our 2020 Technical Poster Contest
- Vannel’s Project Blog Journal
- View the sources of his work in Vannel’s HPCC Systems ML Preprocessor GitHub Repository
- View Vannel’s pull request to the HPCC Systems ML_Core bundle
Mentor Comments from Lili Xu, Arjuna Chala and Roger Dev
Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time on preprocessing. Thus data preprocessing is the first and critical step along the modeling journey. Vannel’s project enhanced the current preprocessing capability of the HPCC Systems Machine Learning library. It makes data science projects more smooth and complete in HPCC Systems.
Yash Mishra
Masters, Computer Science
Clemson University
Leveraging and Evaluating Kubernetes support for HPCC Systems on Microsoft Azure
Use these resources to find out more about Yash’s work:
- View the poster entered into our 2020 Technical Poster Contest
- Community Day Presentation 2020
- Yash’s Project Blog Journal
- HPCC Systems Blog Post
Mentor Comments from Dan Camper
Yash’s project compared early releases of the HPCC Platform’s new cloud configuration with legacy configurations. This was quite challenging, because the cloud configuration was under active development at the time. His work outlined the many architectural changes between the two configurations and he briefly addressed how those changes could impact a runtime environment in terms of cost and performance. Yash’s project serves as the beginning of a comparison framework that could easily be expanded in different directions, such as how different types of workloads react in different environments.
Congratulations to our Class of 2020 Interns
Having read all about their work, I’m sure you get the very clear message about the value added to the HPCC Systems Open Source Project by our interns. Every single one of these projects makes a positive impact on our platform and community users. Our intern program runs for 12 weeks and I think it is all the more impressive that our interns achieve so much in what is really quite a short snapshot of time.
I’ve managed to get this far into the blog without even mentioning the COVID-19 Pandemic, which just goes to show that some things do manage to go on regardless! Our program was impacted slightly. We started the program a little later than usual and some students needed to finish earlier due to time constraints. All students worked remotely (find out more about that here), and we made adjustments as necessary to build in the need for flexibility. Even with all this going on, 7 students still completed internships with us this year and the quality of the work and the positive attitude of everyone involved was as strong as ever.
The fact that this is true, is testimony to the strong work ethic and dedication of the students as well as the support of their mentors.
We thank our interns for their contribution to HPCC Systems in 2020 and everyone who played a part in making 2020 another extremely successful year for the HPCC Systems Intern Program.
The proposal application period for the 2021 HPCC Systems intern Program is now open.
The deadline date for proposal applications is Friday 19th March 2021. Don’t leave it until the last minute! We do award places in advance of that deadline, so get started now!
Find out more about the program, how it works and how to apply by reading this blog or visiting our Student Wiki. To get ideas for a possible project, visit our Available Projects list, watch Tech Talk Webcasts featuring interns or look at the posters illustrating the project work completed.
Have a question about the program? Contact us.