HPCC Systems Machine Learning Library

The HPCC Systems Machine Learning Library provides a wide range of Machine Learning algorithms accessible from ECL, and designed to utilize the parallel computing capabilities of HPCC Systems.

Getting Started

If you are new to Machine Learning or would like a refresher on basic concepts and terminology, please see Machine Learning Demystified.

For a tutorial on installing and using the ML bundles, see Using HPCC Systems Machine Learning.

Overview

The HPCC Systems Machine Learning library is provided as a set of independently installable HPCC Systems bundles. The HPCC Systems bundle capability provides an easy to use mechanism for packaging and installing ECL feature packages. It also supports prerequisites and will notify you if you are missing the required version of a prerequisite bundle.

There are several core bundles that are utilized by the various ML algorithms and one bundle for each supported family of Machine Learning algorithms.

Common Features

Each ML algorithm provides a mechanism for learning from provided training data and retrieving a model (GetModel). That model, the encapsulation of the learning, can then be used to predict values for new data (Prediction / Classification). Furthermore, each algorithm provides methods to assess the predictive power of the learned model (Assessment) in ways that are appropriate for that algorithm.

Each bundle supports the ‘myriad’ interface, which is a way to perform many similar actions on different sets of data, with a single invocation. For example, you may want to create a separate model for each city and then use that set of models to predict data for each city using its own unique model. The myriad interface lets you process those activities in parallel. For more detail and a tutorial, please see our Myriad Interface Tutorial.

Available Bundles

Core Bundles

  • ML_Core
    Provides the core data definitions and attributes for machine learning. ML_Core is a prerequisite for all HPCC Systems production machine learning bundles.
  • PBblas – Parallel Block Linear Algebra Subsystem
    Provides distributed, scalable matrix operations used by several of the other bundles. Can also be used directly whenever matrix operations are in order. This is a dependency for several of our production machine learning bundles (as shown below).

Supervised Learning Bundles

  • LinearRegression
    Ordinary Least Squares Linear Regression for use as a ML algorithm or for other uses such as data analysis.
  • GaussianProcessRegression
    Random Fourier Features accelerated Gaussian Process Regression.
  • LogisticRegression
    Classification using Logistic Regression methods, both Binomial (two-classes) and Multinomial (multiple classes). In spite of the name, Logistic Regression is a Classification method, not a Regression method.
  • GLM
    General Linear Model. Provides Regression and Classification algorithms for situations in which your data does not match the assumptions of LinearRegression or LogisticRegression. Handles a variety of data distribution assumptions
  • SupportVectorMachines
    SVM implementation for Classification and Regression using the popular LibSVM under the hood.
  • LearningTrees
    Decision Tree based learning module. Includes Decision Trees, Random Forest, Gradient Boosted Trees, and Boosted Forest capabilities.
  • Generalized Neural Networks (GNN)
    Parallelized interace to Keras / Tensorflow supporting arbitrarily complex Neural Networks for processing multimedia data types such as Image, Video, and Time-series.

Unsupervised Learning Bundles

  • K-Means
    Unsupervised Clustering Algorithm. Assigns datapoints to one of K clusters based on Euclidean Distance.
  • DBSCAN
    Unsupervised Density-based Clustering Algorithm. Detects cluster boundaries based areas of low density. Produces a variable number of clusters based on density variations.

Natural Language Processing Bundles

  • TextVectors
    Unsupervised vectorization of words, phrases, and sentences. Converts plain text into numeric vectors that can be compared directly or used as features for other Machine Learning algorithms.

Causal Analytics Bundles

  • HPCC_Causality
    Causal Analysis bundle supporting Causal Discovery, Causal Model Validation, Causal Inference, and Causal Metrics. Also includes general purpose Synthetic Dataset Generation and Probability modules.

Bundle Details

Note that in order to install or use any of the bundles, you will need to have installed HPCC Systems Client Tools on your local machine.

ML_Core

  Content Summary: Common data definitions, Common functions, Data preparation functions.

  Prerequisites: None

  Tutorial: Using HPCC Systems Machine Learning, Understanding the Myriad Interface

  Documentation: ML_Core Documentation

  Source code: HPCC Systems ML_Core repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/ML_Core.git [PC users see Note 1]

PBblas

  Content Summary: Scalable Linear Algebra / Matrix Operations

  Prerequisites: ML_Core

  Tutorial: Introduction to PBblas

  Documentation: PBblas Documentation

  Source code: HPCC Systems PBblas repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/PBblas.git [PC users see Note 1]

LinearRegression

  Content Summary: Ordinary Least Squares Linear Regression (multi-variate) with analytics.

  Prerequisites: ML_Core, PBblas

  Documentation: LinearRegression Documentation

  Source code: LinearRegression repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/LinearRegression.git [PC users see Note 1]

GaussianProcessRegression

  Content Summary: Random Fourier Features (RFF) accelerated Gaussian Process Regression.

  Prerequisites: ML_Core

  Documentation: GaussianProcessRegression Documentation

  Source code: GaussianProcessRegression repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/GaussianProcessRegression.git [PC users see Note 1]

Requires Python3 on each cluster server.

LogisticRegression

  Content Summary: Binomial and Multinomial classification with Logistic Regression.

  Prerequisites: ML_Core, PBblas

  Documentation: LogisticRegression Documentation

  Source code: LogisticRegression repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/LogisticRegression.git [PC users see Note 1]

GLM

  Content Summary: General Linear Model for regression and classification

  Prerequisites: ML_Core, PBblas

  Documentation: GLM Documentation

  Source code: GLM Repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/GLM.git [PC users see Note 1]

SupportVectorMachines

  Content Summary: SVM classification and regression with automatic grid search for parameters.

  Prerequisites: ML_Core

  Documentation: SupportVectorMachines Documentation

  Source code: SVM Repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/SupportVectorMachines.git [PC users see Note 1]

LearningTrees

  Content Summary: Random Forest, Gradient Boosted Trees, and Gradient Boosted Forest.

  Prerequisites: ML_Core

  Tutorial: LearningTrees Tutorial

  Documentation: LearningTrees Documentation

  Source code: LearningTrees repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/LearningTrees.git [PC users see Note 1]

GNN

  Content Summary: Generalized Interface to Keras / TensorFlow for Neural Networks

  Prerequisites: ML_Core

  Tutorial: GNN Tutorial

  Documentation: GNN Documentation

  Source code: GNN repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/GNN.git [PC users see Note 1]

Requires TensorFlow to be installed on each cluster server.

K-Means

  Content Summary: K-Means Unsupervised Clustering

  Prerequisites: ML_Core

  Tutorial: KMeans Tutorial

  Documentation: KMeans Documentation

  Source code: KMeans repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/KMeans.git [PC users see Note 1]

DBSCAN

  Content Summary: DBSCAN Unsupervised Clustering

  Prerequisites: ML_Core

  Tutorial: DBSCAN Tutorial

  Documentation: DBSCAN Documentation

  Source code: DBSCAN repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/dbscan.git [PC users see Note 1]

Text Vectors

  Content Summary: Text Vectorization for words, phrases, and sentences

  Prerequisites: ML_Core

  Tutorial: Text Vectors Tutorial

  Documentation: TextVectors Documentation

  Source code: TextVectors repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/TextVectors.git [PC users see Note 1]

HPCC_Causality

  Content Summary: Causal Analysis Toolkit

  Prerequisites: ML_Core

  Tutorial: Causality Tutorial

  Documentation: Causality Documentation

  Source code: Causality Toolkit repository on GitHub

  Installation:

 ecl bundle install https://github.com/hpcc-systems/HPCC_Causality.git [PC users see Note 1]

Requires Python module “Because” on each cluster server.

Notes:

[1] When installing bundles on a PC, the command prompt must be run as Admin. Right click the command icon on the start menu and select “Run as administrator”.