The HPCC Systems Machine Learning Library provides a wide range of Machine Learning algorithms accessible from ECL, and designed to utilize the parallel computing capabilities of HPCC Systems.
Two levels of algorithm are provided:
- The ML Production Bundles include a range of proven, parallelized ML capabilities that are fully supported, documented, validated, and performance tested.
- The ecl-ml package provides a wide range of beta level and experimental machine learning capabilities, which are at various stages of quality, performance optimization and documentation support.
We recommend that you start with the ML Production Bundles and if you need algorithms that are not yet supported as Production Bundles then utilize the ecl-ml package.
The ecl-ml package and its documentation are accessible here. The remainder of this page provides an introduction to the ML Production Bundles. The Bundle Details below provides links to the source code and documentation for each bundle.
ML Production Bundles
If you are new to Machine Learning or would like a refresher on basic concepts and terminology, please see Machine Learning Demystified.
For a tutorial on installing and using the ML bundles, see Using HPCC Systems Machine Learning.
The ML Production Bundles are provided as a set of independently installable HPCC Systems bundles. The HPCC Systems bundle capability provides an easy to use mechanism for packaging and installing ECL feature packages. It also supports prerequisites and will notify you if you are missing the required version of a prerequisite bundle.
There a several core bundles that are utilized by the various ML algorithms and one bundle for each supported family of ML algorithms.
Each ML algorithm provides a mechanism for learning from provided training data and retrieving a model (GetModel). That model, the encapsulation of the learning, can then be used to predict values for new data (Prediction / Classification). Furthermore, each algorithm provides methods to assess the predictive power of the learned model (Assessment) in ways that are appropriate for that algorithm.
Each bundle supports the 'myriad' interface, which is a way to perform many similar actions on different sets of data, with a single invocation. For example, you may want to create a separate model for each city and then use that set of models to predict data for each city using its own unique model. The myriad interface lets you process those activities in parallel. For more detail and a tutorial, please see our Myriad Interface Tutorial.
Provides the core data definitions and attributes for machine learning. ML_Core is a prerequisite for all HPCC Systems production machine learning bundles.
- PBblas - Parallel Block Linear Algebra Subsystem
Provides distributed, scalable matrix operations used by several of the other bundles. Can also be used directly whenever matrix operations are in order. This is a dependency for several of our production machine learning bundles (as shown below).
Ordinary Least Squares Linear Regression for use as a ML algorithm or for other uses such as data analysis.
Classification using Logistic Regression methods, both Binomial (two-classes) and Multinomial (multiple classes). In spite of the name, Logistic Regression is a Classification method, not a Regression method.
General Linear Model. Provides Regression and Classification algorithms for situations in which your data does not match the assumptions of LinearRegression or LogisticRegression. Handles a variety of data distribution assumptions.
SVM implementation for Classification and Regression using the popular LibSVM under the hood.
Decision Tree based learning module. Includes Decision Trees, Random Forest, Gradient Boosted Trees, and Boosted Forest capabilities.
Note that in order to install or use any of the bundles, you will need to have installed HPCC Systems Client Tools on your local machine.
Content Summary: Common data definitions, Common functions, Data preparation functions.
Documentation: ML_Core Documentation
Source code: HPCC Systems ML_Core repository on GitHub
ecl bundle install https://github.com/hpcc-systems/ML_Core.git
Content Summary: Scalable Linear Algebra / Matrix Operations
Documentation: PBblas Documentation
Source code: HPCC Systems PBblas repository on GitHub
ecl bundle install https://github.com/hpcc-systems/PBblas.git
Content Summary: Ordinary Least Squares Linear Regression (multi-variate) with analytics.
Documentation: LinearRegression Documentation
Source code: LinearRegression repository on GitHub
ecl bundle install https://github.com/hpcc-systems/LinearRegression.git
Content Summary: Binomial and Multinomial classification with Logistic Regression.
Documentation: LogisticRegression Documentation
Source code: LogisticRegression repository on GitHub
ecl bundle install https://github.com/hpcc-systems/LogisticRegression.git
Content Summary: General Linear Model for regression and classification
Documentation: GLM Documentation
Source code: GLM Repository on GitHub
ecl bundle install https://github.com/hpcc-systems/GLM.git
Content Summary: SVM classification and regression with automatic grid search for parameters.
Documentation: SupportVectorMachines Documentation
Source code:SVM Repository on GitHub
ecl bundle install https://github.com/hpcc-systems/SupportVectorMachines.git
Content Summary: Random Forest Regression and Classification, Feature importance metric.
Documentation: LearningTrees Documentation
Source code: LearningTrees repository on GitHub
ecl bundle install https://github.com/hpcc-systems/LearningTrees.git