One of the many improvements coming your way this year is a complete restructuring of the HPCC Systems Machine Learning Library. Our ML library may be used alongside the HPCC Systems platform, ECL IDE and the various (and increasing) numbers of embedded language plugins and third party modules that you can use to tailor HPCC Systems to meet your specific needs.
The restructuring is an ongoing process which is likely to run on into 2018, so this won’t be the only post you read about it. Once complete, the HPCC Systems Machine Learning Library will perform better, be easier to use and come supported with more extensive documentation and examples.
The current HPCC Systems machine learning library can be found on github in its own repository as a collection of algorithms. It’s still in use and will be available while we work on the new improved library. The supporting documentation can be found on the HPCC Systems website along with details of how to install the platform, ECL IDE and any embedded language plugins you may need.
The new restructured HPCC Systems Machine Learning Library will provide families of related machine learning algorithms implemented as individual bundles, each one stored in its own github repository. Each bundle will have a dependency on both the PBblas and ML_Core bundles which, in turn, require an HPCC System platform version of 6.2.0 or later. All new machine learning bundles also have a dependency on the HPCC Systems ECL Standard Library which is included as part of the regular installation process.
We are also packaging validation tests to be included in each bundle. Code comments providing information about the attributes included in the bundles, will be also processed into a new wiki. We’d also like to package performance tests for each bundle too, so we are investigating how we might be able to do this for alternative implementations which naturally run on a completely different platform.
We have opted to use the combination of bundles and our own ECL Standard Library because we can more easily leverage the preferred ECL distribution approach of bundles and maintain stability for the critical platform sensitive attributes. The latter is achieved by allowing the bundle author to specify a minimum platform version that can be used with the bundle and this is communicated as a requirement when the bundle is installed.
One of the problems for users of Machine Learning attributes is that it isn’t always possible to determine, in advance, the performance characteristics of a particular capability. To make this evaluation and decision making process easier, all restructured HPCC Systems ML attributes will be classified into one of four performance profiles:
|Proof of concept||No performance guarantees
Feature is known to work but may not be suitable for exploitation at scale
|Suitable for large problems||Feature is suitable for problems that are beyond the capabilities of a solution on a single machine.
Dependent on the nature of the feature, e.g: A training dataset  that is too large to be effectively processed by a single machine would be considered a large problem
|Must be able to provide speed up linearly to an increase in the number of nodes, as the number of nodes increases from 10 to 100
A performance floor is acceptable
|Suitable for large numbers of small problems||Feature is suitable for cases where there are many problem instances to solve, but each instance can be readily solved with the resources of a single machine||The amount of elapsed time is expected to vary linearly to both changes in the number of problem instances and changes to the number of machines between 10 and 100|
|Suitable for both large problems and large numbers of small problems||Must satisfy the descriptions of 2 and 3 above||Must satisfy the requirements in 2 and 3 above|
This is not a data size consideration. A complex network problem may have data that is small enough to fit on a single machine, but the amount of computation on that data makes the problem intractable on a single machine.
For linear algebra based applications, one difficulty is effectively determining appropriate partitioning which can significantly affect performance. Several attributes have been developed to automate the determination of the partitioning scheme. The PBblas attributes will consume and produce matrices as record sets of individual entry records and each attribute will determine the partitioning scheme.
The BLAS functions used by PBblas have been added to our ECL Standard Library (as of HPCC Systems 6.2.x).
An attribute which is capable of running a multitude of small problems, must provide the means for keeping the problems, parameters and results separated into distinct groupings. We are working on an approach that will allow interoperability between the HPCC Systems machine learning features.
Easier to use and contribute
We’re putting in place a better system for processing the comments supplied by contributors to generate the documentation using the codebase as the source. Comments will be required as part of the acceptance criteria for contributions, which should be added as Java Doc style comments in the ECL code. On processing the pull request, the comments will then be processed to build HTML documents with a searchable index, helping you to find what you are looking for easily. Once you’ve found it, you’ll have everything you need to know including a description, parameters including any distribution or sequence requirements (where relevant), the result, a description of any exceptions and the performance profile details. We also plan to extend the documentation to include test descriptions and performance graphs for attributes, since we will require validation tests/sets as a condition for the acceptance of a pull request.
The first two new machine learning bundles available for use will be the Linear Regression and Logistic Regression bundles. Both will be available as of HPCC Systems 6.4.0 and details will be provided in a separate blog.
The addition of three more machine learning bundles is planned for HPCC Systems 7.0.0 (targeted for the end of 2017). Including, Stepwise Linear Regression, Stepwise Logistic Regression and LibSVM.
The next blog in this series will focus on performance characteristics using Matrix Multiply and Solve.