by Flavio Villanustre on 05/10/2012
It is not uncommon to find situations where a classification model needs to be trained using a very large amount of historic data, but the ability to perform classification of new data in real time is required. There are many examples of this need, from real time sentiment analysis in tweets or news, to anomaly detection for fraud or fault identification. The common theme in all these cases is that the value of the real time data feeds has a steep decrease over time, and delayed decisions taken on this data are significantly less effective.
When faced with this challenge, traditional platforms tend to fall short of expectations. Those platforms that can deal with significant amounts of historical data and a very large number of features to create classification models (Hadoop is an example of such a platform), have no good option for real time classification using these models. This type of problems are quite common, for example, in text classification. In these cases, People usually need to resort to different tools, and even homegrown systems using Python and a myriad of other tools, to cope with this real time need.
The problem with these homegrown tools, is that they need to meet all the concurrency and availability requirements that real time systems impose, as these online systems are usually critical to fulfill important internal or external roles for the business (the one anomaly that you just missed because your real time classifier didn't work properly, could represent significant losses for the business).
What makes this even more challenging is the fact that, many times, it is desirable to retrieve and compare specific examples from the training set used to create the model, in real time too. And while developing a system that can classify data in real time using a pre-existing model may be quite doable, being able to also retrieve analogous or related cases would certainly require coupling the system with a database of sorts (just another moving part that adds complexity and cost to the system and potentially reduces its overall reliability).
But look no more, as the HPCC Systems platform may be just what you have been looking for all along: a consistent and homogeneous platform that provides for both functions, and a seamless workflow to move new and updated models, from the system where they are developed (Thor), to the real time classifier (Roxie).
At this point, it's probably worth explaining a little bit how Roxie works. Roxie is a distributed, highly concurrent and highly available, programmable data delivery system. Data queries (in a way equivalent to the stored procedures in your legacy RDBMS) are coded using ECL, which is the same high level data-oriented declarative programming language that powers Thor. Roxie is built for the most stringent high availability requirements, and the data and system redundancy factor is defined by the user at configuration time, with no single point of failure across the entire system. ECL code developed can be reused across both systems, Thor and Roxie.
A scenario like the one I described above, can be easily implemented in the HPCC Systems platform, using one of the classifiers provided by the ECL-ML (ECL Machine Learning) modules on Thor, and running your entire historical training set. To make this even more compelling, all the classifiers in ECL-ML have been designed with a common interface in mind, so plugging in a different classifier (for example, switching from a generative to a discriminative model) is as simple as changing a single line of ECL code. After a model (or several) is created, it can be tested on a test and/or verification set to validate it, and moved to Roxie for real time classification and matching. The entire training set can also be indexed and moved to Roxie, if real time retrieval of related records is required.
Powerful, simple, elegant, reliable. And every one of these components are available under an open source license, for you to play with.
For more information, head over to our HPCC Systems portal (http://hpccsystems.com).