Interns Contributing to the HPCC Systems Cloud Native Platform
Students are accepted on to our intern program to work on a specific project. While we provide a list of projects, students often suggest a project of their own that leverages HPCC Systems in some way. Every year, several interns join the program to complete projects they have designed themselves. Each applicant scopes out their project producing a proposal showing the required tasks and deliverables to be completed during the 12 weeks.
The current development focus of the HPCC Systems development team is to provide a cloud native version of our platform. Our interns have been contributing to and supporting this effort either by working on a specific cloud related project, testing the setting up of an HPCC Systems cluster on a specific Cloud platform, or planning to run their big data analytics in a cloud native environment and comparing the results with our Bare Metal version.
Testing on different cloud platforms
Since our new platform will be cloud native, users will be able to choose which cloud platform they want to use from the many options available. The three platforms our interns have been using include the Google Cloud Platform, Microsoft Azure and the AWS Kubernetes Service.
Google Cloud Platform
Jefferson Mao (High School student, Lambert High School, Georgia) has been working with Xiaoming Wang to setup an HPCC Systems cluster on this cloud platform. While our platform development team is still working on significant aspects of this project, such as providing an easy way to get data on to a cluster and persist it, Jeff has been able to work through the steps involved in bringing up a cluster, including looking at the autoscaling of clusters.
He has also been evaluating the new Google Anthos GKE platform which provides a single platform for the management of all Kubernetes workloads. Anthos GKE allows for the development, deployment, and operation of applications across various public cloud platforms, such as AWS (Amazon Web Services) and Microsoft Azure.
During his research, Jeff has been running some of our own test programs to see how they perform in comparison with our bare metal version. He has been running our regression tests and adding new ones where necessary for areas that are cloud specific such as scaling. He has also been testing our Helm charts, contributing some useful feedback that has helped us to improve how they work.
I should add, at this point, that Jeff’s project was one he suggested himself having learned about HPCC Systems from his involvement in CodeDay during the last year. Given our current development aims, his suggestion was a perfect fit, providing the team with an extra pair of hands and a fresh perspective. More information is available in Jeff’s blog journal.
Microsoft Azure Platform
Yash Mishra (Masters in Computer Science, Clemson University) has previously worked on a project using the bare metal version of the HPCC Systems Platform cloud capabilities. More information about this work can be found in this blog. During his internship, he has moved on to using our cloud native version with the Microsoft Azure Kubernetes Service.
So far, Yash has been able to do the following:
- Customise the Azure infrastructure components based on cost
- Look at the types and numbers of virtual machines that might be needed
- Evaluate the storage options available
- Assess how the HPCC Systems architecture sits alongside the Kubernetes deployment model
- Deployed different types of instances in different regions and discovered that cloud costs also vary by region.
As you can see, he has spent time looking at the costings of running a cloud cluster, which is a significant area to look at. As well as the costing implications shown above, he has verified the expected implications that as the size of the data increases, storage costs and data transfer costs also increase.
Yash intends to continue his research after he completes his internship, to include an assessment of doing some real work on a cloud cluster, once the storage and persist features have been added in a coming release. He is also working on a user guide which will provide a valuable resource for our users. More information is available in Yash’s blog journal.
AWS Kubernetes Service
Robert Kennedy (PhD Computer Science, Florida Atlantic University) has interned with us for three consecutive years, focusing on machine learning projects relating to neural networks. He has been using HPCC Systems with TensorFlow for sometime now and has been directing some of his time to look at how the new cloud native version of our platform will provide additional benefits by being able to access GPUs.
These are the areas Robert has been looking at:
- The costings for this type of work because the latest generation of GPUs and the computers that host them are expensive. Being able to allocate them on demand in a cloud environment could provide significant cost savings.
- Using GPUs aligns better with scientific libraries like TensorFlow, and Robert wants to specifically test whether the Thor processes and their child python processes on the CPU and GPU are terminated at the end of each workunit.
- Whether the underlying container properly sets up the required additional components to work with GPU.
This is not the mainstay of Robert’s project, which you can read a little more about below. But since learning about our ongoing cloud native development, Robert is looking to provide a use case, which will be of great interest to our development team and anyone wanting to carry out big data analytics on our new cloud native platform.
As well as students joining the HPCC Systems Intern Program, we are also hearing about interns joining other LexisNexis Risk Solutions programs who are working on HPCC Systems related projects.
Lucas Varella (Bachelor of Information Systems, Federal University of Santa Catarina (UFSC)) is working with our LexisNexis Risk Solutions colleagues in Brazil.
He is also looking at using the cloud native version of our platform with persistent volume. Lucas is setting up an NFS type storage class so that HPCC Systems can be deployed and scaled on AWS.
Lucas’s internship continues well into 2021. During this time, he intends to extend the work he doing using the AWS Kubernetes Service to include looking at other cloud platforms. The plan is to compare results between different cloud providers, as well as contributing to the work other students are doing, by looking at comparisons with similar jobs that have been executed on our bare metal platform.
Comparisons with HPCC Systems Bare Metal
For a number of our students, it has been important to get results from their project using our bare metal version to enable them to make good progress on their work. However, this does not mean they aren’t contributing to our cloud native development project. A number of interns are planning to run their research on our cloud native version, which will provide a useful comparison focusing on performance and may also provide an interesting analysis on cost implications.
Vannel Zeufack (Masters in Computer Science, Kennesaw State University) joins our intern program for the second year. This year, he is implementing a bundle that will make the data pre-processing phase of machine learning on HPCC Systems easier and faster. He plans to produce a tutorial to demonstrate how the different modules in the preprocessing bundle could be used together to easily prepare data for a machine learning project.
He plans to use our cloud native version to test the performance of each of the modules he is implementing.
Vannel plans to look at the following test cases:
- How the run time of the modules varies with an increasing number of records
- Evaluate the runtime of each module on a varying number of nodes and a fixed record set
- Provide information for users about thresholds and running times when using the bundle with specific configurations.
I have already mentioned the specific contribution Robert Kennedy (PhD in Computer Science, Florida Atlantic University) is making to our cloud native testing efforts. In addition to this, he also plans to test the results from his main project on the AWS Kubernetes Service. Robert’s project involves expanding the HPCC Systems GNN bundle to improve our GPU accelerated neural network training capabilities. By testing his results on both platforms, Robert will be able to provide us with a useful and interesting comparative study as well as feeding in to the usability side of getting up and running on our cloud native platform.
Robert has also been taking an interest in the work of one of our high school interns who has also been using the GNN bundle.
Jack Fields (High School Student, American Heritage School, Florida) has been using the GNN bundle with TensorFlow to train a model to recognise known faces. This is part of an ongoing robotics projects at his school that is also supported by HPCC Systems. Jack and Robert have been talking about whether there may be a collaboration opportunity here, in terms of running some tests using Jack’s results. More information is available in Jack’s blog journal.
Thanks to our HPCC Systems Cloud Native Platform Early Adopters
All the work and research that these students are doing is incredibly useful to our development team.
As early adopters, they have been sharing their experiences and providing valuable feedback that is actively being used by our developers to improve the usability of our cloud native platform.
While on the other side, our interns have been learning a lot about the challenges and excitement of working on a new and evolving software development project.
‘It’s great to see our cloud-native system being put through its paces by these projects. The contributions from our 2020 interns will be a great help to us as we work towards a fully production-ready system in HPCC Systems Release 8.0.0.‘
Richard Chapman, leader of the HPCC Systems Platform development team, VP and Head of Research and Development, LexisNexis Risk Solutions
Find out more about the HPCC Systems Cloud Native Platform
All the information you need to get started using our cloud native platform is available here. This resources contains the blogs the developers have written to help you setup a cluster. You will also find details about Helm charts and some interviews with developers talking about the ongoing development and what you can expect to see in versions coming soon.
Since development work is ongoing, we are constantly updating and adding to these resources. So keep checking back for more details. You can also see the list of changes included in our cloud native development project and bring any issues you find to the attention of the development team using our Community Issue Tracker.
The HPCC Systems Roadmap is available on the HPCC Systems GitHub Repo for those who would like to see the development themes for both versions of the HPCC Systems Platform.
Read this blog to learn more about all interns working working on HPCC Systems related projects in 2020.