Academic Program Spotlight – HSQL, Generative Adversarial Networks and the DBSCAN clustering algorithm

The Rashtreeya Vidyalaya College of Engineering (RV College of Engineering) in Bengaluru, India was established in 1963 and is one of the earliest self-financing engineering colleges in India.  It is rated number one amongst the top ten self-financing engineering institutions in India, offering 12 undergraduate engineering programmes and 16 Masters degree programmes and Doctoral Studies. The RV College of Engineering utilizes its expertise in various disciplines to conduct research and development for industry and defence establishments in India.

We have been collaborating with the Computer Science and Engineering Department of the RV College of Engineering since 2017. During this time, students have worked on a number of research projects using HPCC Systems. We have also welcomed a number of students on to the HPCC Systems intern program in recent years.

The students are supported by Dr Shohba and Professor Shetty while working alongside our LexisNexis Risk Solutions colleagues, Arjuna Chala (Senior Director Operations) and Roger Dev (Senior Architect). Arjuna Chala is an alumnus of the RV College of Engineering and Roger Dev is the leader of our Machine Learning Library project.

Headshot of Dr Shobha G

Dr. G. Shobha is a Professor in the Computer Science and Engineering Department at the RV College of Engineering. Her areas of interest include artificial intelligence, machine learning, image processing and Natural Language Processing. She is also a guest faculty member for distance learning at the Birla Institute of Technology and Science (Bits Pilani). She has collaborated with many industries in executing projects such as AI driven user interface, hand written sketch to code using Machine Learning,  NLP to SQL, a virtual reality platform with a Recommender System for a supermarket, Object Tracking and Recognition, sentiment analysis and more.

Photo of Professor Shetty

Jyoti Shetty is an Assistant Professor in the Computer Science and Engineering Department at the RV College of Engineering. In collaboration with students, she has executed several projects on HPCC Systems, including, implementing a distributed DBSCAN, providing evaluation metrics for a clustering algorithm, and IoT plugin for HPCC Systems, an OpenCV interface for HPCC Systems and more. She finds HPCC Systems a simple and powerful open source platform to execute complex real world problems.

In 2020, Dr Shobha, Professor Shetty and their team are working on three different projects. Let’s meet the teams and find out a bit more about each project.

Design and development of HSQL for HPCC Systems

This project focuses on implementing a new SQL-like language that simplifies the usage of the HPCC Systems Platform. It is designed to work in conjunction with ECL, which is the primary programming language for HPCC Systems. The intention is that HSQL should be an easy to work with and robust language for general purpose analysis and basic machine learning workflows. The syntax has been designed to be simple and SQL-like, allowing the use of various useful ECL specific features such as modules.

Image showing the translation to ECL process

The team has been working on the initial creation process of this language, laying out its grammar and a designing a compiler which can convert HSQL to ECL, using generated LL(*) parsers. So far, most of the HPCC Systems Machine Learning Library is available for use from within HSQL and various methods of processing such data have also been added, while keeping a strong emphasis on distributed computing. HSQL has been made to be highly extensible and fully interoperable with ECL, so that beginners can get started with HPCC Systems very quickly.

The team is currently testing ways of integrating the language into existing IDEs and looking at how HSQL can be used to simplify the whole process of using HPCC Systems for data analysis.

So far a great amount of progress has been made and this project should make HPCC Systems easier to adopt.

Meet the HSQL Team

Photo of Anurag Singh Bhadauria

Anurag Singh Bhadauria is a 2nd year undergraduate student at the RV College of Engineering. Anurag has interests in compiler design and machine learning applications and says he has found HPCC Systems ‘a great entry point for being able to handle big data‘. He is also interested in the study of various formal languages.

Photo of Atreya Bain

Atreya Bain is a  2nd year undergraduate student at the RV College of Engineering. Atreya has keen interests in distributed computing and compiler design. Atreya says that he finds HPCC Systems ‘a great open source solution for distributed computing and machine learning’.

GAN Bundle on HPCC Systems

Generative Adversarial Network (GAN) is an architecture that pits two “adversarial” neural networks against one another in a virtual arms race.   Given a dataset of samples, such as images of cats, the “generator” network tries to produce new images while the “discriminator” network attempts to spot those fakes.  As training progresses, the generator learns to produce undetectable forgeries, while the discriminator becomes very good at spotting fakes.  This has been used to train a generator that can produce unique art works that can’t be discerned from a genuine article.  The trained discriminator can be used, for example, to detect unnatural fraudulent transactions.

Bringing this into HPCC Systems platform will be greatly beneficial, due to the large data management property of the HPCC Systems Platform which parallelizes the training across multiple nodes to handle very deep learning.

GANs were implemented using the ECL language and the HPCC Systems Generalised Neural Network (GNN) module which which provides an ECL interface into Tensorflow.

Image showing how the GAN works

GANs have been implemented on ECL by transferring weights between the generator, discriminator and combined, to connect them and enable the right kind of training for GANs. It has been tested successfully on a MNIST dataset to generate handwritten numbers and the results of that testing may be seen below:

Results from the testing on the MNIST Dataset

There is also a further implementation of using a discriminator to classify data due to the adversarial training that has taken place so far for generation. This could help in classifying data at a much higher accuracy with more training as it trains with more and more fake generated data!

Using neural networks in risk solutions would be very useful as it tends to be very accurate when trained efficiently. For example, GANs can be used to predict fraudulent transactions in real time if trained with an appropriate dataset. With further implementation of various architectures, HPCC Systems will be able to move forward with a lot more similar kinds of applications.

Meet the GAN Team

Photo of Ambu Karthik

Ambu Karthik is a 2nd year undergraduate student at the RV College of Engineering. Ambu has worked on robotics, data analysis and networking in previous projects. He is curious and very keen in understanding data science and integrating it with his areas of interest.

HPCC Systems provides a great combination of networking and big data analysis alongside using machine learning, which provides a great opportunity to expand his knowledge within his specific areas of his interest.

Photo of Rohit Sadavarte

Rohit Sadavarte is a 2nd year undergraduate student at the RV College of Engineering. Rohit has a keen interest in machine learning along with the maths behind the metrics. He is a keen explorer for new techniques and methods which could perform better than known algorithms.

His interest in HPCC Systems grew when he discovered that the cluster computing capability added more value to his brainstorming and made the whole process of build algorithms much more fun.

Adaptive DBSCAN for Big Data Analytics

The clustering algorithm is one of the efficient techniques of unsupervised learning. There are a variety of clustering algorithms like K-Means, BIRCH, DBSCAN, etc. Out of all of these, the RVCE team believe that DBSCAN is the preferred method because it provides a way to deal with clusters of various shapes and sizes. It is limited, however, by the threshold distinguishing the dense regions for datasets with varying density regions. It has applications in various fields like anomaly detection, clustering satellite images, crystallography of X-ray and scientific literature. This project aims to provide an adaptive approach to find the threshold that identifies areas of rapidly changing density to bound clusters.

Our adaptive approach uses a Gaussian kernel approach, the threshold for which are found by grid search. Here, the threshold is the lower bound for the similarity between points of the dataset. The similarity between points is found using the Gaussian kernel. The Grid Search algorithm searches for all the possible values of the parameter and chooses the optimal solution based on silhouette score. It is able to find clusters of varying density in an accurate and efficient manner.

Image showing Adaptive DBSCAN

Meet the DBSCAN Team

Photo of Varsha R Jenni

Varsha R Jenni is a 2nd  year undergraduate student at the RV College of Engineering.

Varsha has interests in machine learning and distributed computing and says she has found HPCC Systems to be ‘a great open source platform which makes data processing analysis easier and faster’.

Photo of Akhil Dualists

Akhil Dua is a 2nd year undergraduate student at the RV College of Engineering.

Akhil has interests in machine learning application and data analysis and has discovered that HPCC Systems is a great open source platform for building learning models efficiently and accurately.

The HSQL, GAN and DBSCAN projects are ongoing and more information about their progress will be available later in the year. I’m looking forward to hearing about their achievements and seeing the results of these hard working students.

Projects completed in previous years

Thanks to students from the RV College of Engineering who have contributed to the HPCC Systems Open Source Project in previous years:

RVCE Student interns – Class of 2019

Photo of Akshar Prasad

Akshar Prasad’s project focused on detecting fraud in stored value cards by applying CNN and Random Forest machine learning models on transactional data, to classify a transaction as fraudulent or non-fraudulent. His work involved carrying out a comparison of these methods. Akshar was mentored by Roger Dev (Senior Architect, LexisNexis Risk Solutions), who is the leader of the HPCC System Machine Learning Library project.

Find out more about Akshar’s project by listening to his Tech Talk presentation and looking at the poster he entered into our 2019 Technical Poster Contest (Watch Recording / View Slides / View Poster).

Photo of A Suryanarayanan

A Suryanarayanan carried out an evaluation of our machine learning algorithms. His work involved running comparisons with existing benchmarks, the addition of new evaluation metrics and the enhancement of performance checking. Surya was mentored by Arjuna Chala (Senior Director Operations, LexisNexis Risk Solutions) and Lili Xu (Software Engineer III, LexisNexis Risk Solutions).

Find out more about Surya’s project by listening to his Tech Talk presentation. (Watch Recording / View Slides).

Photo of Sathvik K R

Sathvik K R’s project focused on allowing the embedding of Octave database queries within ECL code, using simple wrapper classes to handle scalar values and structured data, including multi-threaded access from the ECL side. Sathvik was mentored by Dan Camper (Senior Architect, LexisNexis Risk Solutions).

Find out more about Sathvik’s project by listening to his Tech Talk presentation and looking at the poster he entered into our 2019 Technical Poster Contest (Watch Recording / View Slides / View Poster).

RVCE Student Intern – Class of 2018

Photo of Jayashree Ukkinagatti

Jayashree Ukkinagatti completed a project which involved implementing the continuous integration of ROXIE queries and data deployments using Jenkins. She was mentored by Anthony Fishbeck (Senior Architect, LexisNexis Risk Solutions) and Rodrigo Pastrana (Architect, LexisNexis Risk Solutions).

Find out more about this project by listening to Jayashree present about her work in one of our Tech Talk webcasts (Watch Recording / View Slides).

A number of research papers have been published by students from RVCE. Our Academic Publications page includes links to these papers and all other research papers linked with HPCC Systems related projects and collaborations that we are aware of to date. Let us know if you have or know of an HPCC Systems related research paper we should add to this growing list of notable achievements.

RVCE hosted the 4th IEEE International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS 2019). A number of students, including some of our 2019 HPCC Systems interns, presented papers about HPCC Systems related projects. Read about this event and the 45 papers that were presented across subjects such as social media, IOT, image processing, Biotech, Cloud computing and more.

Dr Shobha presented a webinar on Improving the Efficiency of Machine Learning Algorithms using the HPCC Systems Platform at the IEEE Computer Society in June 2020. View the poster promoting this event and listen to the recording of Dr Shobha’s presentation (Watch Recording / View Slides).