Sailing through machine learning on a PaperBoat
While the ECL-ML (ECL Machine Learning) libraries currently support a variety of prevalent algorithms in machine learning, there could always be the need for the one that has not been added just yet. And, the fact that ECL-ML provides a distributed linear algebra library, which greatly simplifies distributed vectorized implementations, is a blessing, but it still requires some coding in ECL to add new algorithms.
Fortunately, the folks over at Ismion, Inc., and particularly Nick Vasiloglou, have done something about it. They have ported their highly optimized PaperBoat library to HPCC, and made the integration so seamless (through some very clever abstraction layer based on ECL Macros) that PaperBoat functions are available as native ECL definitions.
The most interesting aspects of PaperBoat are around efficiency, and I couldn’t say it more clearly than the authors themselves:
PaperBoat has been developed based on the C++ template metaprogramming principles. This approach makes PaperBoat easily configurable and efficient. For example the data are stored always on the minimum data precision needed.Every column is stored in the precision the user specifies. Columns with the same precision are stored next to each other. This triggers vectorization speedups offered in any modern preprocessor. Also libraries like BLAS/LAPACK/FLAME can speed up vector operations. Templatization also avoids the virtual function overhead and it allows the compiler to do extensive optimizations, since all the code is available at compile time. Our experiments showed 4x speedup over an implementation of the library with virtual functions. Another feature of PaperBoat is threading. All fundamental algorithms are tasks that are executed asynchronously. Synchronization of tasks is based on a data availability model inspired by datalog. Another advantage of the PaperBoat is the multidimensional indexing structures that can speed up orders of magnitude machine learning algorithms. Multidimensional trees can speed up things in two ways, either by clever stratified sampling algorithms or by clever geometric tricks, that lead to efficient branch and bound pruning.
The list of algorithms currently supported by PaperBoat is also enticing. While there is some overlap with what is currently available on ECL-ML, and ECL-ML could be preferable for massive amounts of data with large number of features, PaperBoat could be a choice for less extreme cases. And the beauty of it, is that the user gets to choose by changing a single line of ECL (or, why not, even run both and compare the results?).
One interesting algorithm that is available under PaperBoat and not ECL-ML yet, for example, is LASSO, a method for regression shrinkage and selection in linear models, which is favored by some people in the scoring and analytics industry (and if you’re curious about LASSO, you can check out the original paper here: http://www-stat.stanford.edu/~tibs/lasso/lasso.pdf).
I hope that, by now, I picked up your interest, so don’t waste any more time and head over to the HPCC Systems portal, and then check out http://ismion.com/documentation/ecl-pb/index.html for a good tutorial on PaperBoat and ECL.