Skip to main content

Lily Xu joined the team as part of the HPCC Systems intern program in the summer of 2016. Lily is a PhD student at Clemson University, studying Computer Science which includes options in machine learning, data mining and software architecture. Lily submitted a proposal to implement the Yinyang K-Means clustering algorithm in ECL as a new feature to be included in the HPCC Systems machine learning library.

The classic K Means algorithm is one of the most widely used algorithms for cluster analysis in data mining, mainly because of its simplicity and general applicability. However, it doesn’t always scale well and can be slow. The Yinyang K-Means algorithm gets its name from the ancient Chinese philosophy of the same name, which describes how seemingly opposite or contradictory forces may work in a complimentary way towards harmony. The key benefit of the YinYang K-Means algorithm is in its careful but efficient maintenance of the upper bound of the distance from a point to its assigned cluster and the lower bound of the distance from the point to the other cluster centres. The interplay between these upper and lower bounds creates a two-level filter, allowing the Yinyang K-Means algorithm to avoid making unnecessary distance calculations. It has been shown that, on average, this algorithm is at least twice as fast as the classic K-Means algorithm.

Clustering algorithms are used to uncover categories, which can be extremely helpful to an industry sector like retail. A company collecting information about its customers and what they buy, can group customers together who buy the same products and have a similar personal profile. This analysis may then be used, for example, to send direct mail offers by email, or help make product purchasing decisions based on what a company knows they can sell to a specific customer category.

HPCC Systems comes with its own proprietary programming language, ECL. So, most of the students who apply to complete internships with us, not only have to become familiar with the platform, but they also get to learn a new programming language. We encourage students to take training courses and tutorials to help them get going quickly. But it’s important to note here that the internships only last for 10 weeks. It’s particularly impressive and a testimony to the ease of use of the ECL language, when students like Lily, manage to get themselves up to speed and implement what is a very complex clustering algorithm in such a short space of time. If there are students reading this blog who are wondering what qualities we look for in our interns, Lily is great example to follow. Be curious, interested and excited.

There is more to do on this project. It is working extremely well in Roxie and although the internship ended in August, Lily is still working to ensure that this algorithm works just as well on Thor. This will establish a baseline for the next phase which includes adding a grouping capability to her Yinyang implementation, working on optimizing performance and creating performance test cases to demonstrate the capabilities of the Yinyang K-Means algorithm. We hope that Lily will submit a proposal for this work in 2017. We’d love her to come back as a returning student to complete the great work she started in the summer 2016.

Lily prepared a very professional poster outlining the work she completed as part of her internship, which she entered into our poster competition at the HPCC Systems Engineering Summit in October 2016. It was great to meet her in person having heard so much about her and her work. If you want more detail about her project and intern experience, read her blog journal.

More information about our HPCC Systems interns of 2016...

  1. Read about Suk Hwan Hong and Column Level Security on HPCC Systems
  2. Read about Syed Rahman and the CSCS Machine Learning Algorithm 
  3. Read about Sarthak Jain and the Latent Semantic Analysis Machine Learning Algorithm
  4. Read about Vivek Nair and his machine learning regression suite and ML plugins for the Data Science Portal
  5. Read about Shweta Oak and Non-negative Matrix Factorization on HPCC Systems

More about internship opportunities...

  1. Find out about intern opportuities available with LexisNexis.
  2. Interested in a student internship involving coding, machine learning etc? Read about the HPCC Systems intern program.