Sarthak Jain joined the HPCC Systems platform team as a student contributor during our involvement with the Google Summer of Code program in 2015. He added new statistics to the HPCC Systems Linear and Logistic Regression machine learning module. He returned to the team as part of the 2016 HPCC Systems summer intern program, again working on a machine learning related project, implementing Latent Semantic Analysis (LSA).
The LSA project was suggested to us as a useful addition to our machine learning library by our users. We prepared a project description which Sarthak studied and thought about, producing an excellent proposal. This is how the HPCC Systems intern program works. Students select a project in which they are interested from the list of available projects and produce a proposal showing how they would complete it.
So what was so great about Sarthak’s proposal? He showed clearly what the deliverables would be in terms of the tasks required to complete the implementation including testing and any documentation required. He supplied a timeline showing what he would complete when for each of the 10 weeks of the internship, where relevant adding information about possible challenges that might be encountered, calculations to be used and links to any documents relevant to the implementation. In addition, he included a wish list of other tasks he thought would be great extensions to the project, which could be done either during the internship if time allowed, or later. This showed his awareness and understanding of the project and the development process. He also demonstrated a real commitment to the successful conclusion of the project. Our student wiki includes some useful information about how to prepare a great proposal and stand out from crowd.
Latent Semantic Analysis is used to analyse documents to find the underlying meaning or concepts. Since languages like English contain many ambiguities, it can be difficult to draw conclusions without taking into account, for example, synonyms and words with multiple meanings and context. Writing styles can also differ between authors who may use different words to describe the same thing. LSA helps to filter out some of the ‘noise’ that can be introduced by the random choice of words used or those words which may be insignificant to the underlying concept. When searching on Google for a book title on a specified subject, it’s not terribly useful to see results where the only word in common between the matching titles identified is the word ‘the’ or ‘and’. But it is useful to filter out results that don’t match the concept you have specified, so if you are looking for results on a financial topic, you probably want to ignore hits referring to river banks!
Sarthak was able to leverage open source implementations for factoring matrices, by using PB-blas to extend a factorization running on a single node, into a factorization that can factor matrices which are too large for a single node. You can find out more details about Sarthak’s implementation by reading his blog.
Sarthak is a final year student at Delhi Technological University, studying a BTech in Computer Engineering. He worked remotely from India with his mentor, John Holt who is the leader of the HPCC Systems machine learning library project. Most of the students who intern with us work remotely, maintaining regular contact with their mentor and checking in their code in exactly the same way as the developers who regularly work on the HPCC Systems platform development team.
More information about our HPCC Systems interns of 2016…
- Read about Suk Hwan Hong and Column Level Security on HPCC Systems
- Read about Syed Rahman and the CSCS Machine Learning Algorithm
- Read about Lily Xu and the YinYang K-Means Clustering Machine Learning Algorithm
- Read about Vivek Nair and his machine learning regression suite and ML plugins for the Data Science Portal
- Read about Shweta Oak and Non-negative Matrix Factorization on HPCC Systems
More about internship opportunities…