New release of our ECL Machine Learning libraries (ECL-ML)

A lot has happened since the version 1.0 release of our Machine Learning libraries. As you can see by checking out our ML portal (http://hpccsystems.com/ML), there are a ton of new algorithms, and significant improvements to existing ones.

Logistic regression, for example, has received a much needed revamp and besides, discriminative classification methods tend to be more widely used than generative methods, due to their better asymptotic error convergence (if you’re curious about this, you could check out this classic paper from Andrew Ng: http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf). However, if you are a generative methods fan, check out our Naive Bayes classifier implementation too.

But what is possibly more interesting is the fact that now all our classifiers, including perceptron, logistic regression and Naive Bayes, sit behind a unified classifier interface, which allows them to be swapped in and out very easily. This is extremely convenient if you need to see which classifier is the best choice for a particular problem, and/or if you want to learn multiple models at once.

I can’t leave the topic on classifiers, without mentioning classical decision trees, which are also part of this release.

Clustering, now includes K-D Trees (http://en.wikipedia.org/wiki/K-d_tree), a very cool data structure widely used for efficient multi-dimensional nearest neighbor searches and GIS storage and retrieval, among others.

The discretizers, essentially providing functionality to allow for easy conversion of continuous values into discrete buckets that can be used for a classifier, have also been subject to a substantial review and improvement, with a much better and easier to use interface.

On the document n-gram extraction functions, the inclusion of a Porter stemmer (http://tartarus.org/~martin/PorterStemmer/) is a much welcome addition, to ensure that English word terminations and inflections don’t get in the way of NLP related tasks.

A couple other interesting newcomers are Singular Vector Decomposition (SVD) and Primary Component Analysis (PCA), for dimensionality reduction, such as, for example, when building anomaly detection systems. SVD is also useful to deal with certain linguistic ambiguity problems and, if you’re interested on this particular topic, this general tutorial should help: http://www.minerazzi.com/tutorials/singular-value-decomposition-fast-track-tutorial.pdf.

One area that had significant development, is visualization. With the addition of visualization components to ECL, collectively included within our VL sub-tree, HPCC provides, in a simple and straightforward way, a way to display graphical charts representing the tabular data results, making it very convenient to quickly spot interesting graphical aspects in the data results. I can’t emphasize enough the ease of use, and the fact that it doesn’t require resorting to any external tools (batteries included here, too!).

In sum, if you are a machine learning professional, or even if you have some interest on highly scalable distributed machine learning implementations, head over to: http://hpccsystems.com/ML and take a look. You won’t be disappointed.

Flavio Villanustre