This year’s HPCC Systems Intern Program cohort is the biggest ever, totalling 12 students from across the academic spectrum. Here are some details about the overall profile of our 2021 intern cohort:
- One post graduate researcher
- One student completing a Masters degree in Computer Science
- Six undergraduate interns
- Four high schoolers
- Two returning students
- 9 entered our 2020 Poster Contest
- 9 students located across the USA
- 3 students located in India
We are running a completely remote program again this year, due to Covid-19, but in case you don’t know this already, our program always has a remote working element to it. Students located in the USA, do have the option to be office based if there is an office closeby, but every year, there are students who prefer to work remotely. The remote working option also allows us to accept students on to the program from other global locations. This year we welcome three students located in India. Find out more about working remotely on the HPCC Systems Intern Program.
Each student hand picked their project and provided a proposal, using our guidelines, scoping out the tasks involved. We ask for a 12 week timeline in the proposal and this timeline provides a clear path forward from the day they start their internship. Some of the projects were chosen from our available projects list and others were suggested by students themselves. Our available projects list is currently under review, so if you want to find out about projects available for next year, the list will be available again in the Fall, when the proposal period opens for 2022 internships.
This blog and the Previously Completed Projects pages on our Student Wiki, will give you a great idea of the sorts of projects on offer every year. Don’t forget, that you can also suggest your own idea but it must leverage HPCC Systems in some way. For details about the application process, read my blog Join the HPCC Systems Team as an Intern.
Our intern program has a rolling start to accommodate differences in school semester dates. Each student has at least one mentor with other team members providing guidance as needed. Let’s meet each student, hear a little about their project and find out what they have achieved so far.
Masters of Computer Science, Clemson University, USA
Using Azure Spot Instances
Roshan Bhandari is a member of Dr Amy Apon’s Clemson University team which carries out research on Big Data systems, focusing on optimizations and improvements at both the data and network layers. HPCC Systems is a sponsor of their Data Intensive Computing Lab and we have been collaborating with Dr Apon and her team since our Academic Program began. Roshan took a Cloud Computing course earlier in 2021, which supported his application to complete a cloud related project during his internship.
Roshan’s project involves creating utilities to estimate the money saved by running HPCC Systems in spot instances in different regions in the Azure cloud, as well as developing scripts to automate cluster formation in Azure Kubernetes Service (AKS). Another important task of this project involves developing strategies for job recovery after eviction and cleaning after an eviction notice.
Roshan has developed scripts to collect prices using CLI and API. He has also developed an API to return the cheapest region and price for a given instance size. His project also involved developing scripts to automate HPCC Systems installation in the AKS cluster/Azure Spot instance and more scripts to stream an eviction notice out of the spot instance/AKS cluster. Since minimizing the cost of setting up cloud infrastructure is very important for all companies, this project provides us with some valuable insights to share with our users as they adopt HPCC Systems Cloud Native.
Roshan has completed his internship and will be providing a blog to share his work with our open source community, but until then, his blog journal provides insights into his internship project and experience.
Data Scientist, North Carolina State University, USA
Ingestion and Analysis of Collegiate Women’s Basketball GPS Data in HPCC Systems and RealBI
Chris Connelly is a returning student, having completed an internship with us in 2019. His previous intern project involved collecting GPS data from the NCSU men and women’s soccer teams, processing the data and analysing it on HPCC Systems to provide insights into how the athletes can main and improve their fitness and technique. Find out more about this project by reading Chris’s blog journal and listening to him speak about his work at our 2019 Community Day Summit (Watch Recording / View Slides). His role as a Sport Scientist involves working with athlete performance and well being data, analysing it to help student athletes improve their performance.
His project this year focuses on bringing the women’s basketball team data into HPCC Systems, cleaning and analyzing it, and then using Real BI to visualize the data in a dashboard with tabs showing a variety of results.
The process of bringing the data in and running the analysis is all prepared and the next step is to making his scripts work on a Kubernetes cluster. In the coming weeks, Chris will be working on the code to complete the different types of analytics which will allow him to move on to working on the dashboard report tabs in Real BI. For more details on how this project is going, read his blog journal.
Chris’s project is a great use case for the new HPCC Systems Cloud Native platform, putting him in the perfect position to become an early adopter and provide valuable feedback to the development team. Find out more about the HPCC Systems Cloud Native Platform in a series of blogs and videos located on our Cloud Native Wiki Page as well as our Helm Chart GitHub Repository.
Bachelor of Computer Science and Engineering, RV College of Engineering, India
Improvements of the HPCC Systems Structured Query Language (HSQL)
Atreya Bain joins our Intern Program for the first time in 2021 however, he has been working on the HSQL project for some time as part of our academic collaboration with Dr Shobha’s team at Rashtreeya Vidyalaya College of Engineering, Bengaluru, India. The purpose of the project is to implement a new SQL-like language that simplifies usage of the HPCC Systems Platform. It is designed to work in conjunction with ECL and for use with general purpose analytics and basic machine learning workflows. View the poster he entered into our 2020 Poster Contest.
During his internship, Atreya will be improving the HSQL project for general usage. The overall goal is to:
- Define an initial syntax set for HSQL
- Provide a working compiler that can convert HSQL to ECL
- Provide a VSCode extension for use with HSQL
Atreya has been working on the HSQL compiler and syntax, adding an additional stage in the compiler for semantic analysis. He has also reconfigured the compiler to be more than just a CLI tool. It can now be used as part of an IDE plugin, or compilation server. Recently, he added support for assignments, imports and outputs, and is currently implementing the SELECT query. A basic version of the VSCode extension which provides syntax highlighting, syntax checking and compiling using the mentioned compiler has also been implemented.
In the coming weeks, he will move on to add support for visualizations, procedures/functions, creating and using Machine Learning Models (which in this version can be extended) and once this is ready for broader testing, work will be needed to provide features that are useful to data analysts. Keep up with progress on this project by reading his blog journal.
Bachelor of Software Engineering, Green River College
COVID-19 Tracker and Global Map Improvement using ECL Watch
Eleanor heard about us via CodeDay in 2020 and worked on some improvements to our COVID-19 Tracker with Arjuna Chala, who is mentoring her during her 2021 internship.
Eleanor’s project involves adding more features to the global map capability in our Covid-19 Tracker. These features are based around travel data. Specifically data that can be used to help a user or group plan their travel from home to the airport and then to their destination.
Given a source and destination of travel (land or air), users will be able to:
- Calculate risk of the travel
- Report the social distancing guidelines in every location
- Create a brand-new map to show the entire journey interactively
Eleanor has been keeping a blog journal about her progress which is available here.
The COVID-19 tracker provide important new information that can potentially have a positive impact not only on the decisions taken by individuals, but also by those at local, state and country level as the changing situation presented by the pandemic is assessed. More information about our COVID-19 tracker, including a link to the tracker itself can be found in this blog by Roger Dev.
Find out more about CodeDay by reading this blog about our involvement in the Winter 2021 event.
Bachelor of Arts in Computer Science and Chicano Studies, University of California, Berkeley
Implement a PMML Processor
Alexander also found out about HPCC Systems via CodeDay, which he has taken part in as a participant, volunteer helper and organiser. He also submitted a poster to the SIGCSE Technical Symposium 2021 on the subject of Closing the Gap Between Classrooms and Industry with Open Source Internships. See the poster and find out more about our CodeDay interns in this blog.
Alexander’s intern project is in the field of machine learning. He will be implementing a Predictive Model Markup Language (PMML) Processor using ECL and providing a user friendly interface.
The PMML to ECL (and back) project is being developed rapidly and Alexander has already made a lot of progress. So far, the converter works in both ways for simple basic (and multiple) Linear Regression machine learning models. The converter takes in a .pmml/.xml file and returns a .ecl file, containing the code needed to make predictions. Conversely, the converter also takes in a .ecl file and compiles it, turning it into a PMML model in the process.
Alexander is working on making it easier for users to convert files and providing support for other algorithms, such as Logistic Regression, Random Forests, Neural Networks, etc. Find out more about Alex’s work in his blog journal.
High School Interns
Our 2021 intern program welcomes the largest group of high school interns since the program started in 2015, including our youngest ever intern, Amy Ma, who is 15 years old.
We hope the experience of our high school interns encourages others to have the confidence to put themselves forward to complete projects with us in future years and perhaps, at some point, we will welcome back our high school achievers as Undergraduates, Masters or PhD students. In fact, we do have a returning high school intern this year. Jefferson Mao, joins the program for the second time having completed a project with us in 2020.
We provide project ideas to suit students from high school through to PhD, who are considering a future career in a technology related profession such as software development or data science. As I mentioned earlier, our list is currently under review and will be available later in the year when we open the proposal period for 2022.
Our 2021 high school interns are doing some great work on a variety of projects involving our new Cloud Native platform and also our Machine Learning Library.
Marjory Stoneman Douglas High School, Florida, USA
Amy’s initial aim was to get up and running by completing some HPCC Systems training. This involved learning about the new HPCC Systems cloud infrastructure, Kubernetes and Ingress. After installing Kubernetes on the local docker desktop and deploying Ingress on Kubernetes, she began to use the Azure cloud, deploying HPCC Systems on the local Docker desktop and running ECL Watch successfully.
Having completed the initial training and preliminary tasks, Amy is carrying out more Ingress exercises on Azure, which involves looking at Ingress features such as basic Ingress functions, configuring TLS encryption on Azure and authentication with a username and password, and traffic splitting. One of her tasks is to deploy an nginx controller on Azure and look at more nginx controller features through annotations and configMap.
Next steps involve preparing a spreadsheet of results of all Ingress controllers tested, routing patterns and adding some example files to the HPCC Systems GitHub Repository, including bash scripts and ECL test code. After testing HPCC Systems functionality, she will identify any necessary changes and add-ons needed for the HPCC Systems platform to support Ingress usage.She will then write some guidelines for using Ingress in an HPCC Systems service. For more information on this project, read Amy’s blog journal.
This project is one of many Cloud specific intern projects in 2021 that support our most recent development focus to provide a Cloud Native HPCC Systems Platform. There are a number of blogs you can read about our new Cloud Native Platform, but to get started, I suggest reading this introduction to our Cloud journey by the leader of the platform development team, Richard Chapman. For more information, visit our Cloud Native Wiki Page which includes links to a number of blogs, videos and GitHub Repository resources.
American Heritage School of Boca Delray, Florida, USA
Processing Robotics Data with an HPCC Systems Cluster on Kubernetes
Carina Wang is a member of the Stallion Robotics Team 5472, run by Tai Donovan (Robotics Program Director and Instructor) at her school. The team are working on an autonomous security robot that can recognise potential risks on a school campus that might otherwise be missed by the human eye. The goal is to allow the robot to use facial recognition to identify known and unknown campus visitors. Carina’s project involves extending the work completed by her team mate Jack Fields, who joined our intern program in his final school year in 2020.
The first step required on starting her internship was to complete some training, which involved reading through relevant chapters of Hands on Machine Learning by Aurelion Geron, the TensorFlow 2 Quickstart guide and the TensorFlow Convolutional Neural Network Tutorial, while simultaneously running the corresponding code on Jupyter.
Carina followed the Containerized HPCC Systems Platform documentation, GNN tutorial and the VS Code ECL Plug-In installation guidelines to prepare the local environment to be used during her internship.
Carina’s internship requires her to become familiar with relevant platforms and programs, running sample codes, preparing a database of photos and training the facial recognition model using the HPCC Systems Cloud Native Platform and Machine Learning Library. Each picture will need to be tagged and the aim is that the model will accurately identify students attending the school. The facial recognition model needs to be compatible and work in conjunction with the devices mounted on the security robot. Touchscreens, Alexa Voice Commands, Virtual Reality environments, etc. will be the among the possible mechanisms to display the information to the robot user.
Carina would like to implement a solution that allows a student to walk up to the robot and retrieve information as part of a larger, interactive security feature. This involves using data from an augmented image set to train a GNN model on Azure using the HPCC Systems Cloud Native Platform to classify the images. For more information on this project, read Carina’s blog journal.
Northview High School, Georgia, USA
Apply Docker Image Build and Kubernetes Security Principles
During the first few weeks of her internship, Nikita spent a lot of her time on the docker security components aspect of her project. Specifically, she was focusing on providing an option for enabling or disabling the cache, depending on the images and files that being built (since different types of builds may require storing cache while others do not). She has also been running the HPCC Systems Docker images in different vulnerability scanners (Trivy, Docker Scan, and Anchore) to evaluate any security threats and possible solutions.
She has also been investigating user privileges in the HPCC Systems images to ensure the ‘least privilege’ best practice principle is enforced. This is important in reducing the risk of attackers gaining access to sensitive information and critical systems due to a compromised account or device.
In recent weeks, Nikita has been thoroughly researching the detrimental effects of using the ‘Latest’ tag, which can have unwanted effects such as the overwriting of docker image layers and the inability to properly tag and retrieve the image. She has prepared a report for the team showing the results of her research.
In the coming weeks, Nikita plans to work on the Kubernetes side of the project, focusing first on developing a thorough understanding of the Kubernetes environment and all of the security best practices for the technology. Then she plans to move on to work on securing the HPCC Kubernetes environment on Azure based on those best practices. To keep up to date with the latest progress on this project, read Nikita’s blog journal.
Lambert High School, Georgia, USA
Toxicity Detection Platform Integrated with HPCC Systems Cloud and GitOps
Jefferson Mao is a returning student to the HPCC Systems Intern Program. Last year he worked on a Cloud related project which involved evaluating using the HPCC Systems Cloud Native Platform on the Google Cloud. This year, Jefferson suggested two projects which have morphed into one, combined project. His Toxicity Detection Platform idea, stems from his gaming hobby. Interactions via chat rooms and social media and other channels, can sometimes become, bullying, abusive, threatening etc. He aims to provide a way to score interactions to help identify the worst abuses so these channels can do something about them and provide a safer place for people to gather online. The GitOps side of his project provides the team with a real test case for the creation and deployment of HPCC Systems clusters through Arc and GitOps using the management layer that provides a number of beneficial functions such as central versioning, continuous integration and delivery etc.
Jefferson’s starting point has been to create a proof of concept via Recurrent Neural Network (RNN) using Keras and Tensorflow. He is using GloVe word embeddings provided by the Natural Language Processing Group at Stanford, which serves as a major benefit to the Toxicity Detection Model, because word embeddings are learned representations of words that are displayed as mathematical vectors. As the toxicity detection model will only be dealing with text/natural language, having GloVe word embeddings simplifies and improves the accuracy of toxicity detection.
Word embeddings combined with the datasets provided by Kaggle will ultimately result in a highly accurate RNN model. However, Jefferson has discovered that RNN models have a significant flaw. They are plagued by the Vanishing Gradient Problem and Exploding Gradient Problem. This is caused by backpropagation, which is the prediction of outputs using current inputs and the inputs before it. The problem is caused by the matrix “Wrec” (Weight recurring). Without getting into the math, Jefferson has provided the following explanations:
If the value of Wrec is greater than 1, the gradient becomes exponentially large. Whereas if it is smaller than 1, the gradient gradually disappears. Hence the vanishing gradient problem.
Exploding Gradient Problem Solution
- Stop backpropagation at some point
- Artificially reduce the gradient
- Put a limit on a gradient
Diminishing Gradient Problem Solution
- Initialize weights so the vanishing gradient effect is minimized
- Echo State Networks
- Long Short Term Memory Networks (LSTM)
In Jefferson’s case, he is encountering the Diminishing Gradient Problem and has decided to solve the issue by implementing a Bidirectional LSTM. As well as solving the issue, it also increases accuracy since the bidirectional path allows the model to see into the future and the past.
Jefferson has already made great progress to the point where he almost has a POC ready to show and you can find out more by reading his blog journal.
Causality Machine Learning Project Team
There is a new project currently underway spearheaded by Roger Dev, Senior Architect, LexisNexis Risk Solutions Group. Roger is the leader of our Machine Learning Library Project. The focus for 2021 is Causality and the aim is to develop a full-bodied Causality Toolkit for HPCC Systems. It is a large project with many moving parts, so three students accepted on to the HPCC Systems Intern Program have joined the project and each student is working on a specific area.
If you are interested in learning more about this ongoing development project, you can read more about in the Causality 2021 blog by Roger Dev, who has also written a technical piece about Reproducing Kernel Hilbert Spaces (RKHS), which form the basis of the most powerful and performant algorithms in this area.
Bachelor of Computer Science and Engineering, RV College of Engineering, India
Causality – Probabilities and Conditional Probabilities
Achinthya’s main goal is to improve the conditional probability calculation for the Causality Project. So far he has implemented RKHS embedding of conditional probabilities. The results are promising as they provide higher accuracy than the existing methods.
Going forward, Achinthya will be working on speeding up the RKHS calculations using the Dual tree approximation method. Then he will need to figure out how to condition on multiple variables (since so far it has all been P(Y|X) ). The next stage is to integrate this method to the existing probability calculation bundle and collaborate with Mayank Agarwal and Mara Hubelbank to see how they can benefit from each other’s work. The final stage involves testing the methods on a live dataset.
Achinthya writes about his progress in more detail in his blog journal.
Bachelor of Computer Science and Engineering, RV College of Engineering, India
Causality – Independence, Conditional Independence and Directionality
Mayank’s first task was to get a good understanding of RKHS and its implementation in Python, which involved reading relevant papers, interpreting their experiments and completing a short project to work on to reinforce learning points.
Having completed some working code that implemented RHKS, Mayank experimented with various kernels, while trying to understand them by reading research papers. The next step was to implement Conditional Independence using RHKS kernels and Hilbert-Schmidt Norm in Python and improving it to fit the needs of the project. Unfortunately, the results from using this technique to compute the Conditional Independence were unsatisfactory and implementing a conditional independence code that conditionalized on 2 variables simultaneously was extremely difficult. So a decision was taken to shift focus to using RCotT which is another method of calculating the Conditional Independence. The reference paper on RCoT provided an implementation in R, so Mayank had to convert the code into Python while also understanding the Random Fourier Features concepts.
Currently, Mayank is working on RCoT code and fixing any bugs as he goes along. As soon as the code is completed, he will be ready to start testing the RCoT code and compare the outputs with the concepts it should match. To ensure that the code is tested properly, Mayank is also working on the R version of the RCoT cod, which will allow him to compare the outputs and be sure the results are up to the required standard. The next phase involves working on the Lindsay Pilla Basak approximation which is done on the Kernel Conditional Independence Test to achieve the RCoT and makes the code more efficient.
Next, comes the implementation of the RIT Code which is used for the testing of the Unconditional Testing. The RIT code also uses the concepts of the Random Fourier Features and uses a kernel trick to minimize the evaluation time and increase efficiency. This will be tested on the synthetic data and matched with the concepts. The code will then be integrated with the independence testing and probability and will be ready for testing on real world data, taking efficiency and accuracy into account. Mayank shares more information about this project in his blog journal.
Bachelor of Computer Science, Northeastern University
Counterfactual and Interventional Layers
Mara has been working with Roger Dev to design and implement enhancements to the HPCC Systems Causality Framework, focusing particularly on developing a module for counterfactual queries. The research phase involved read up on the theory and implementation of causal inference in statistics and machine learning. Currently, she is working on readability and structural improvements, including adding documentation to existing features. Mara has also been experimenting with Roger’s algorithms for probabilistic, causal, and interventional queries and has begun to implement counterfactual algorithms and create tests for the module using synthetic data.
There is plenty more to come in this area. In the coming weeks, Mara will be finishing the implementation of the counterfactual module and test using synthetic data. She will also be researching and processing real-world observational datasets to be used in testing the module and enhance the capabilities of the intervention tool by designing a feature for additive intervention. She will also update the user documentation to include these structural and functional improvements before her internship ends. Mara provides more detail about her work in her blog journal, which also includes a bibliography of her research.
Intern Posters and Presentations
Students who join the HPCC Systems Intern Program are required to present about their work to the team. As these recording become available, we will share more about the projects and provide links to their presentations, giving you the opportunity to hear them talk about their contribution to our open source project in their own words. Some of the students may also produce blog posts to share final results and achievements.
Students are also required to prepare a poster illustrating their work which they can then enter into our 2021 Poster Contest. They also supply an abstract and 5 minute video for the judges. All these resources are made available to our open source community during our Community Day event, after the judging of the following awards has completed:
- Best platform enhancement
- Best use case
- Best data analytics
- Best research project
A voting feature at our Community Day event, allows attendees to vote for their favourite poster, which results in the presentation of our Community Choice Award to the winner based on votes cast on the day. Winners of all awards will be announced during our 2021 Virtual Community Day event.
Take a look at our 2020 Poster Contest to get an idea of the range of projects and the high standard of work completed by our interns.
You don’t have to be an HPCC Systems intern to enter our poster contest, students working on a project that leverages HPCC Systems in some way may be eligible to enter. To find out more, read the contest rules or contact us if you have questions.