Cruising the ML World with HPCC Systems and ECL
The 2021 HPCC Systems Virtual Community Day Summit included a series of three, one hour workshops, providing some hands on, tutorial style ECL language training sessions focusing on some of our Machine Learning Library bundles. All sessions were recorded giving you the opportunity to take each one in sequence to complete the full course.
Join our trainers, Bob Foreman and Hugo Watanuki as they take you on a journey into the world of using machine learning with HPCC Systems. Follow along as they demonstrate how to use our DBSCAN, K-Means, Logistic and Linear Regression, Generalized Neural Networks and Learning Trees machine learning bundles.
Bob Foreman
Senior Software Engineer
LexisNexis Risk Solutions Group
Bob has worked with the HPCC Systems big data analytics platform and the ECL programming language for over 10 years and has been a technical trainer for over 25 years. He is the developer and designer of the HPCC Systems Online Training Courses and is the Senior Instructor for all our classroom and Webex/Lync based training.
Learn more about his experience of teaching ECL and HPCC Systems and how our training materials and lessons have evolved to meet the needs of our open source community, helping businesses solve real world data problems.
Hugo Watanuki
Senior Software Engineer
LexisNexis Risk Solutions Group
Hugo has been supporting the development and delivery of training programs for the HPCC Systems platform in the Brazil region since 2019. Hugo has worked for over 14 years on various technical roles in the IT industry with a focus on High Performance Computing. He is also a part time researcher on Information Systems and a member of the UK Academy for Information Systems.
Learn more about how the Brazil team are engaging with universities looking to collaborate with industry experts to teach students big data analytical skills, engage in research projects and obtain data science skills.
Join a Workshop Session
Each workshop is one hour long. Recordings of these sessions are available on the HPCC Systems YouTube Channel via the links shown below and are best completed in the following order:
While our machine learning bundles tend to be platform version independent, it is recommended that you start with the most recent HPCC Systems Client Tools version and you’ll also need Git for Windows. Our ML bundles also require the installation of our ML-Core and PBblas bundles.
Details of the recommended Client installation prerequisites (such as ECL IDE, Git, Machine Learning Bundles to install, code examples and data source details), are available in the HPCC Systems Community Workshops GitHub Repository
Workshop 3 introduces Neural Networks and Ensemble Learning which require the installation of Tensorflow and Python. Details of how to do this are provided in the workshop, but if you want to prepare in advance, the instructions are provided in the relevant section below.
Please Note: To get the most out of these sessions, taking our Introduction to ECL – Part 1 and Part 2 courses beforehand is highly recommended. You will need to have a basic familiarity with the ECL language, including query building and data handling (Extract Transform and Load).
Workshop 1 – Introduction/Unsupervised Learning
This workshop starts with an introduction to the HPCC Systems Machine Learning Library, focusing specifically on the unsupervised learning clustering algorithms DBSCAN and K-Means.
There is a lot of specific terminology associated with machine learning, so this session sets the scene, introducing a number of concepts illustrated by an example, such as:
- Creating a training set and inference
- Choosing and building a model
- Generalisation and overfitting
- Prediction error
- Quantitative and qualitative models
Find out how all this translates into the world of HPCC Systems and the ECL language, learn how to install the machine learning bundles and discover some possible use cases for clustering algorithms like DBSCAN and K-Means. The following learning points are covered:
- Preparing the data
- Training a model
- Assessing the model and analysing it using the Silhouette Coefficient
- Comparison of K-Means and DBSCAN
A demonstration of both bundles in action is provided, showing how the concepts covered can be applied to real world problems
Workshop 2 – Supervised Learning
Supervised learning is the most common type of machine learning, involving the training of a system where the record sets and target output patterns are provided to perform a task. In this workshop, you will use the HPCC Systems Logistic Regression and Linear Regression machine learning bundles.
This session starts with a definition of each algorithm and an explanation of how they work. Each bundle contains a number of functions, so whether you are familiar with either Logistic/Linear Regression or a new user, the full list of functions available is provided with an explanation of the purpose of each one.
An analysis of the advantages and disadvantages of using each bundle is provided as well as a demonstration of use.
Workshop 3 – Neural Networks and Ensemble Learning, GNN and Boosted Trees
This workshop focuses on using the Generalized Neural Networks Bundle (GNN) and the Boosted Forests algorithm from the Learning Trees Bundle available in the HPCC Systems Machine Learning Library.
Neural Networks and using the HPCC Systems GNN Bundle
Deep Learning is an area of machine learning that involves using artificial neural networks and algorithms that are inspired by how the human brain learns as it absorbs and analyses large amounts of data.
This section of the workshop is designed to provide the basics to get you started using Neural Networks with HPCC Systems and ECL, helping you to lay a firm foundation to build on in the future. Learn about the advantages and disadvantages of using Neural Networks and three of the most popular types used.
The installation of Tensorflow and Python is required for use with the GNN Bundle. If you are administrating or building your own cluster, here are the steps for installation:
- On Ubuntu, first refresh the APT (Advanced Package Tool) repository:
sudo apt update
- Install Python3 if not already installed:
sudo apt install python3
- Install pip3 (Python3 package installer). This will take a few minutes.
sudo apt install python3-pip
- Install tensorflow for all users. This is the recommended approach, since it needs to be available to the hpcc user as well as the current user. The –H sudo option is necessary in order to have it installed globally:
sudo –H pip3 install tensorflow
- Finally, the setuptest.ecl file found in the Test directory of the GNN bundle will verify that Python3 and Tensorflow are correctly installed on each Thor node.
In a cloud configuration using Docker with Kubernetes and Helm, the installation of GNN support is as simple as selecting the appropriate Helm chart, for example:
helm install myhpcc hpcc/hpcc --version "8.2.0-2" --set global.image.version=8.2.0-2 --set global.image.name=platform-gnn
Discover how to use the GNN Tensor Module to:
- Build ECL Tensors
- Define the Keras Layers used to create the Tensorflow model
- Define the settings use to compile the Keras model and configure the training process
The full list of functions available in the GNN bundle is provided with a description of use, glossary of terms as well as examples of use and unit test programs to help evaluate accuracy and performance. Walk through some custom examples and interpret the results.
Some background reading about using Neural Networks and the HPCC Systems GNN Bundle is available in our blog, Analyze images, videos, time-series and more with the Generalized Neural Network bundle (GNN).
Ensemble Methods – Boosted Forest and Trees
Ensemble Learning methods use multiple learning algorithms to achieve better predictive performance than might be achieved using any of the algorithms individually. This workshop introduces the HPCC Systems Learning Trees Bundle, which includes Decision Trees, Random Forest and Gradient Boosted Trees algorithms.
Learn how the different types of Decision Tree algorithms work, their scalability, prediction accuracy and why they are considered to be one of the easiest algorithms to use.
A demonstration showing the use of the BoostedRegForest algorithm (Regression using Boosted Forest) is provided. Boosted Forests are a combination of the Gradient Boosted Trees and Random Forest algorithms, taking advantage of their accuracy and ease of use respectively.
Some background reading about using the HPCC Systems Learning Trees Bundle is available in our blog, Learning Trees – A Guide to Decision Tree Based Machine Learning.
Other Training Opportunities
A full suite of online training courses is available on the HPCC Systems website, providing a range of learning opportunities. Digital badges are awarded to all who complete our online training courses. An additional badge is awarded to those who complete all the core competency courses and the Ultimate Master badge goes to those who successfully complete the entire suite.
Note: The advanced ECL and machine learning courses require the completion of our introductory ECL courses as a prerequisite.
ECL Core Classes
These online classes provide access to both our introductory and advanced ECL language classes as well as use of ROXIE queries and some optional course for those wanting to extend their knowledge further:
- Introduction to ECL Part 1 – Concepts and Queries
- Introduction to ECL Part 2 – The Extract Transform and Load (ETL) Process
- Advanced ECL Part 1 – Working with Relational Data
- Advanced ECL Part 2 – Superfiles, working with XML and free form text parsing
- ROXIE ECL Part 1 0 Indexes and Queries
- ROXIE ECL Part 2 – Complex Query Development
- Applied ECL – ECL Code Generation Tools (Optional)
- Applied ECL – Special Projects (Optional)
Machine Learning
This online class covers much of the information included in the workshops above, although it does go into a little more depth about the Myriad Interface.
Managers Summary
The Introduction to HPCC Systems for Managers provides a basic familiarity of using HPCC Systems and how the ECL language can be used to build powerful data queries.
Administration
These courses provide a suite of sessions for those managing HPCC Systems environments. Our suite of courses starts with an architectural overview, routine maintenance and best practises to observe, moving on to focusing on the specific needs of a Thor and/or ROXIE cluster:
- HPCC Systems Administration
- Introduction to HPCC Systems Administration – Thor Clusters
- Advanced HPCC Systems Administration – ROXIE Clusters
There are a number of How To Videos for those interested in quick tips on using HPCC Systems and specific ECL language features as well as a variety of presentations from our developers and trainers.
Many more learning opportunities are available on the HPCC Systems YouTube Channel, including presentations from our conferences on a variety of topics and use cases that may be relevant to your own project.
Getting Help and Keep in Touch
If you have questions or would like to connect with others working on HPCC Systems projects, use our Community forums to post comments and questions. There are a number of different forums focusing on specific areas:
- General Forum – Announcements, jobs and student programs
- Developer Forum – All things ECL
- Data Scientists – Machine Learning and data analytics
- Administrators and Operations
- Tool and Plugins – VS Code, Java, R etc
Report issues using the HPCC Systems JIRA issue tracker and keep in touch with development news by reading our blog and subscribing to our newsletter.
If you are new to HPCC Systems, find out more about us here.