2022 HPCC Systems Interns and Projects
The HPCC Systems Intern Program runs during the summer months with students joining throughout the months of May and early June. It’s a global program and in 2022, students from Europe, Asia and the USA joined the team, hence the rolling start dates which allow us to accommodate semesters in different geographical locations.
If you have been following our intern program over the years, you will know that students across the academic spectrum are eligible to take part and this year’s cohort spans the entire range from high school though to PhD. Here are some details about the group:
- Two high school students, five studying for a Bachelors degree, three studying for a Masters degree and one completing a PhD
- Two students located in Europe (Finland and Italy), one located in India and eight located across the USA
- Two non coding projects focusing on documentation and marketing (new category in 2022)
- Two NLP projects (another new category in 2022), three ML projects, three platform features/enhancements, one use case (using HPCC Systems ML)
This year’s program follows a well established pattern, covering a wide range of abilities and project types which is great to see. There is something for everyone here, so prepare to get excited about the contributions being produced by this hardworking group of young technologists.
At our weekly Chat and Share meetings, speakers are invited to come and present to students about a variety of topics to widen their exposure to office life. So far this year, interns have heard about our cloud native platform, the importance of regression testing, the ADAM program, employment opportunities, how to produce resources for our poster contest, and presentation skills (every student presents to the team at some point).
We had the leader of the HPCC Systems Platform Team, Richard Chapman (SVP and Head of Platform Engineering) kick off these Chat and Share sessions, so it felt fitting that our final meeting should feature the leader of the HPCC Systems Open Source initiative, Flavio Villanustre (SVP and Chief Information Security Officer, LexisNexis Risk Solutions Group).
This blog provides a snapshot of each intern and their project. Interns are encouraged to keep a blog journal, which means you can continue to follow the progress on projects that are of particular interest to you.
Vote for the Community Choice Winner at our 2022 Poster Contest
Look out later in the year for the posters interns will submit into our 2022 Poster Contest and remember, you get to vote for one of the prizes as an attendee of our 2022 Community Day Event in October. The Community Choice Award goes to the poster presenter who wins the vote of attendees on the day, so make sure you plan to be there and have your say!
Last year, our open source community voted 2021 HPCC Systems Intern Atreya Bain from RVCE in India as the Community Choice Award winner. If you want to learn more about his poster on Improvements to the HPCC Systems Structured Query Language, view his poster here.
Take a look at all last year’s poster contest participants which also includes entries from students working with our academic partners on HPCC Systems related projects.
HPCC Systems Intern Program 2023
The proposal period for the 2023 HPCC Systems Intern Program will open in the Fall. In the meantime, if you are a student thinking of applying or know someone you’d like to encourage to join, visit our list of available projects (new projects will be added soon) and read our blog about the program for more information.
Natural Language Processing Intern Projects
A category for NLP projects was newly added in 2022. David Dehilster (Consulting Software Engineer, LexisNexis Risk Solutions Group) is leading this initiative and it is certainly a passion of his. He wants to create Digital Human Readers for different languages that can understand text just as well as a human who speaks that language (read his blog on the subject here).
Students interested in contributing to this project far outnumbered the two places we had available in 2022. We hope that this interest will continue with more students contributing dictionaries, phrase parsers and sentiment analysis in the future.
Two students, Ananya Gupta and Lucas Wang, joined the 2022 HPCC Systems Intern Program to complete NLP projects mentored by David Dehilster.
Ananya Gupta
PhD Human-Centered Computing, Clemson University, USA
Nepali Wiktionary Initiative and Translation
Ananya Gupta joined the HPCC Systems Intern Program to contribute a dictionary for her native Nepali language. In her proposal, Ananya outlined how she would like to be a pioneer in providing a way to bring enthusiastic people together to create a wictionary that will mean the Nepali language is no longer an under resourced language in comparison with for example, English or Chinese. While there may be around 110,000 recognised words in Nepali, Wiktionary, for example only includes around 16,705 words.
Ananya’s work involves developing an analyzer to look at Nepali text and incorporating this analyzer into Wiktionary for lookup and additional NLP analysis. This means adding additional words from Nepali dictionaries and developing a parser to extract, preprocess and clean the data before analysing it. Ananya is using HPCC Systems for the preprocessing. She needs to allow for multiple meanings of words and run tests for accuracy which may include running the dictionary lookup analyser using the HPCC Systems NLP++ Plugin on a set of Nepali texts.
Ananya has been generating additional interest in her project by issuing a press release calling for help in building the Nepali Wiktionary and has created the Nepali NLP Forum FaceBook page that is increasing its readership daily. She may well achieve her aim to be a pioneer for her native language in the world of NLP!
Find out more about this project by reading Ananya’s Blog Journal.
Lucas Wang
Bachelor of Electrical Engineering and Computer Science, University of California, Berkeley, USA
NLP++ Dictionary for the Chinese Language
Lucas Wang chose this project because of his curiosity about how the Chinese language can be represented and understood by a computer given that it is a language that uses characters. Chinese also has two character phrases and idioms that may be represented by 4 or more characters which have different meanings to the individual characters. As such, it is a unique challenge from the perspective of computer processing.
Lucas needs to filter the relevant knowledge base file for Chinese (downloaded from Wiktionary), formalising on a single standard to read one Chinese dialect. Using the chosen standard, he needs to create analyzers that understand each Chinese character and then parse them into the knowledge base. He also needs to write a ‘parts of speech’ tagger and create a knowledge base for these as well.
Find out more about this project by reading Lucas’s blog journal.
Non Coding Intern Projects
This category was also new in 2022. We are always looking for ways to improve and extend our program, which in past years, has focused on coding projects only. In 2022, we decided to provide a small selection of non-coding projects to attract students interested in technology who do not study a computer or data science related subject. Both projects were snapped up fast. We hope to provide more projects in this category in future years.
Two students, Amy Ma and Elizabeth Lorti, joined the 2022 HPCC Systems Intern Program to complete non-coding projects.
Amy Ma
Marjory Stoneman Douglas High School, Florida, USA
Document the HPCC Systems Data Patterns Functionality
Amy is a returning student to the HPCC System Intern Program, having completed a project in 2021 relating to Ingress Configuration for our Cloud Native Platform (find out more here). Amy returns as an 11th grader to work with our Documentation Team, and is being mentored by Jim DeFabia (Consulting Software Engineer, LexisNexis Risk Solution Group). Data Patterns allows ECL developers to perform data profiling by inspecting datasets as part of the data discovery process. A report is produced that can be viewed in ECL Watch.
While we do already have some documentation available, there are three different ways to use Data Patterns. Amy’s work brings all available documentation together in one place. It will provide details on how to use Data Patterns within ECL Watch, using the ECL Standard Library and via the Data Patterns Bundle. Once completed, this new edition to our documentation suite will be available on our website.
Find out more about Amy’s project by reading her blog journal.
Elizabeth Lorti
Bachelor of International Development, King’s College, London, UK
Technology Branding and Marketing
In 2022, we are updating the HPCC Systems website and introducing a new logo. Having released our Cloud Native Platform, we will be making some changes on our website to direct users to all the new resources available and evaluating our social media presence. Elizabeth joins the team to assist with evaluating our brand and updating website content, inching brochures, white papers, case studies and more.
She started her internship by carrying out a competitive analysis of other open source technology platforms to provide some fresh ideas for how we might update the look, feel and usability of our website. She has since moved on to looking at analysing our social media strategy and YouTube presence, including evaluating how we may improve our using of tagging and hashtags. All this will provide valuable information to our Director of Marketing, Jessica Lorti, who is mentoring this project.
Find out more about Elizabeth’s project by reading her blog journal.
Causality 2022 Intern Projects
Work on the Causality Machine Learning project started in 2021 when three students joined the HPCC Systems Intern Program to work alongside Roger Dev (Senior Architect, LexisNexis Risk Solution Group). To find out more about the Causality project and learn about the projects completed last year, read Roger’s Causality 2021 blog. During last year’s research, the team found that the most powerful and performant algorithms for Causality statistical methods tend to be based around Reproducing Kernel Hilbert Spaces which Roger talks about in this blog.
The HPCC Systems Causality Toolkit is now available. You can find out more about how to use it here and the bundle is available in our Causality GitHub Repository.
This year, two students, Zheyu Shen and Arun Gaonkar joined the 2022 HPCC Systems Intern Program to contribute to the Causality Project, mentored by Roger Dev.
Zheyu Shen
Master of Data Science, Columbia University, USA
Causality Algorithm Development
Zheyu’s list of deliverables for this project involve designing and developing test cases for comparing causality tasks between different implementations and algorithms, to determine which ones are the most widely used and/or produce the best results.
He has been looking at a number of different causal analysis packages to determine which causality package they use and how they have been implemented.
He also needs to assess the implementations currently available in our own Causality Toolkit against other public packages. This will allow Zheyu to use appropriate metrics to perform comparisons and explain the performance, with a view to implementing additional causality algorithms in the HPCC Systems ML Library.
He will be looking at how our toolkit performs against the following algorithms when carrying out the same causality tasks:
- Dowhy
- CausalML
- Causal discovery Toolbox
- Causal Inference 360
Find out more about this project by reading Zheyu’s blog journal.
Arun Gaonkar
Master of Computer Science, North Carolina State University, USA
Applying the Causality Toolkit to Real World Datasets
Arun’s work involves testing out the HPCC Systems Causality Toolkit to evaluate how it performs on real world datasets with a view to verifying that the results achieved match expectations.
Completing this project requires him to create a causal model for the dataset, use the toolkit to verify the model and extract causal inferences from the model. He has spent some time evaluating and processing the datasets before analysing them and moving on to building the causal model.
Graphs have been generated, allowing Arun to assess the success of the model before applying the causal toolkit and evaluating the results.
Find out more abut this project by reading Arun’s blog journal.
Other Machine Learning Projects
Arya Adesh
Bachelor of Computer Science and Engineering, RVCE, India
Local Outlier Factor Algorithm for Anomaly Detection in ECL
As well as providing project ideas for students to work on, we accept students on to the program every year who suggest a project of their own that leverages HPCC Systems in some way. Arya’s project falls into this category. His proposal outlined a piece of research he wanted to complete, focusing on implementing an unsupervised machine learning algorithm for anomaly detection.
His main task was to implement in ECL the Distributed Local Outlier Factor method for anomaly detection for use on HPCC Systems. Next steps involved testing the method on real world datasets, comparing the results with those from other anomaly detection algorithms, such as, Isolated Forest and DBScan. His comparative analysis also including looking at the python implementations for small size datasets. The project also involves looking at scope for improvements to enhance the flexibility of his method for larger datasets.
Arya’s mentor is Lili Xu (Software Engineer III, LexisNexis Risk Solution Group) and more information about his project is available in his blog journal.
Sarvesh Prabhu
Lambert High School, Georgia, USA
A Comparative Study of Neural Networks and Tree Based Deep Learning Methods in the Image Classification of Colorectal Medical Imagery
Sarvesh joined the 2022 HPCC Systems Intern Program as a junior in high school. His project was also his own suggestion, focusing on researching how to use HPCC Systems, ECL and our Machine Learning Library to provide an effective diagnosis and prognosis for colorectal cancer. The first task was to standardise the images and form classification/feature extractions using the HPCC Systems GNN Bundle. The next stage is to use the HPCC Systems GNN library with Keras/Tensorflow to train a GNN model with two full connected layers.
Following on from this, Sarvesh will use the convolutional auto encoders produced to extract image features with final classification but a tree based model such as Random Forest, Gradient Boosted Forest etc. Finally, he will conduct a comparative analysis by performing a series a tests looking at model performance, the complexity of hyper-tuning, the danger of overfitting, accuracy of prediction on a new dataset as well as implementation and maintenance.
Sarvesh’s mentor is Bob Foreman (Software Engineer Lead, LexisNexis Risk Solution Group) and more information about his project is available in his blog journal.
HPCC Systems Platform New Features and Enhancement Projects
Jack Del Vecchio
Bachelor of Computer Engineering, Miami of Ohio University, USA
Interfacing MongoBD into ECL
We have a number of projects available for students to choose from with regard to adding support for additional embedded languages.
Jack chose MongoDb because he had been working on some projects in his own time using MongDB, finding it simple and intuitive to use. He looked at the existing MySQL and Python plugins to get an idea of what might be required to create a similar plugin for MongoDB.
His project involves creating a number of functions for:
- Querying a non-sharded collection and returning the documents to the user
- Updating and removing documents from a collections using single threading
- Sorting the collection using a single threaded call
- Implementing MongoDB aggregations
- Sharding a collection
Jack also had to create a number of test cases for:
- Inserting various amounts and types of documents into the collections
- Aggregations (which allow for updating documents)
- Sharding a collection based on a range of the sharding key and for hashed sharding keys.
- Multi-threaded inserts using a sharded collection
- Using the sharding strategies to create a way for a distributed Thor query to interact with individual shards through multiple threads
- Querying the sharded database using both single and multi-threaded methods
This project also includes creating supporting documentation.
Jack’s mentor is Dan Camper (Enterprise/Lead Architect, LexisNexis Risk Solution Group) and more information about his project is available in his blog journal.
Noah Seligson
Bachelor of Computer Science, University of Central Florida
Provide Test Code for Bundles with no Self Test
Noah’s project involves developing testing for the HPCC Systems Machine Learning Library files by creating test code with expected passing results.
These tests will be made available in our Overnight Build Test Suite (OBT), which means they will be tested automatically very regularly, removing the need for developers to set aside extra time to carry out manual tests.
The full list of available bundles is provided in the HPCC Systems ECL Bundle GitHub repository. Noah has focused specifically on the Machine Learning Bundles and the tests he has created are provided in the ecl folder for each of the following:
Noah’s mentor, Attila Vamos (Consulting Software Engineer, LexisNexis Risk Solution Group), is clear about the the value of this contribution to our open source platform. During his internship, Noah more than doubled the number of tests available, which is a significant contribution.
More information about Noah’s project is available in his blog journal.
Shivam Singhal
Master of Software Engineering, University of Oulu, Finland
ECL Code Document Generator Improvements
Shivam Singhal joined the program having already gained some experience of working with other open source technologies such as Mozilla and the Open Mainframe Project, where he worked as an intern on an analytical tool designed for the inspection and audit of data on a Blockchain ledger.
Shivam’s HPCC Systems intern project involves analyzing the weaknesses of the current system, recommending and implementing key improvements, providing testing, documentation and producing a supportable GitHub repository.
The existing documentation generator is written in Python, although there will be opportunities to get some exposure to the ECL language. Key tasks to be completed include:
- Performing a substantial transformation of the source data to create usable features
- Developing additional features through Shapelet Mining of the data series
- Applying Random Forest machine learning to classify cards as fraudulent/non-fraudulent
- Producing Test code demonstrating the correctness and performance of the algorithm
- Providing supporting documentation
Shivam’s mentor is Lili Xu (Software Engineer III, LexisNexis Risk Solution Group) and more information about his project is available in his blog journal.
More to come on these projects
Some of our interns are already approaching their final weeks on the program which closes at the end of August 2022. There will be plenty of opportunities to hear more about individual projects in the coming months. Some interns will provide a blog that will be featured on our website, others will speak at our 2022 Community Day Summit in October.
All our interns are producing posters about their work which will be available alongside an abstract and 5 minute video presentation (see last year’s here).
Interns also present to the entire HPCC Systems team so we can all share in and celebrate their successes. The HPCC Systems Intern Program is a real team effort from the adding of the project suggestions on our list, right through to seeing and using the results.
We are all very proud of this program and the impressive results produced by the hard working and extremely able students who join the team every year (see what previous interns have contributed here).