As the end of the 2012 calendar year approaches, at least for a good chunk of the world (and may come to an end on 12/21 for a crazy bunch), some people start celebrating holidays in different cultures and countries. I consider this season a good time to go over things to come in the HPCC Systems platform arena.
Our 3.8 release is out (3.8.6-4 can be downloaded from here) and, while there may still be minor bug fixes (no, there are never bugs, just "under-appreciated features" that we may want to get rid of), 3.10 is well in the works. 3.10, which should be released in the next few weeks, changes the open source license to Apache 2.0 and brings a number of enhancements to the platform in different areas, as you can see from the commit history in our GitHub source code repository.
4.0, the next major release, is already in the plans and will bring a number of exciting new features, including improvements to our ECL Playground (if you haven’t played with it, I strongly recommend it), significant improvements to ECL watch, support for very fast linear algebra through PB-BLAS (not to be confused with PBLAS) and some other interesting developments, such as more thorough integration with R and reporting tools through external connectors, improved SQL/JDBC connectivity on dynamic Roxie queries for interactive reporting tools, and improvements around documentation and usage examples.
In the next few weeks, also expect to see improvements to our Portal, with the addition of a Wiki for collaborative documentation and a description of our general HPCC Systems roadmap and ongoing projects, to help community members decide if they’d like to join any of these efforts. As part of this move, we are planning to include specific projects that could be good starters for some community members, so please let us know if you would like to tackle any of those.
During 2013, we will be actively working to continue raising awareness for the HPCC Systems platform, and will be specifically focused on community building activities (details coming later). And, from the Exciting Training Department, we are currently working on creating a significant amount of materials for our upcoming MOOC (Massive Open Online Courses), which will help you learn everything that you ever wanted to know about ECL, but were afraid to ask.
And now the shameless plug: Trish and I are tasked with the organization of the first Big Data track as part of the 2013 Symposium on Collaborative Technologies, in cooperation with ACM and IEEE, to happen in May, in San Diego, so please feel free, and a little bit compelled? :) to submit papers, present posters and let us know of any other way that you may want to help.
Now go and enjoy with your families and have a great holiday season! Happy Hacking!
I often get asked about comparing the HPCC Systems platform and Hadoop. As many of you probably know already, there are a number of substantial differences between them, and several of these differences are described here.
In a few words, HPCC and Hadoop are both open source projects released under an Apache 2.0 license, and are free to use, with both leveraging commodity hardware and local storage interconnected through IP networks, allowing for parallel data processing and/or querying across this architecture. But this is where most of the similarities end.
From a timeline perspective, HPCC was originally designed and developed about 12 years ago (1999-2000); our first patent around HPCC technology was even filed back in 2002, and HPCC was in production across our systems back in 2002. To put things in perspective, it wasn’t until December 2004 that the two researchers from Google described the distributed computing model based on Map and Reduce. The Hadoop project didn’t start until 2005, if I remember correctly, and it was around 2006 when it split from Nutch to become its own top level project.
This doesn’t necessarily mean that you couldn’t say that certain HPCC operations don’t use an scatter and gather model (equivalent to Map and Reduce), as applicable, but HPCC was designed under a different paradigm to provide for a comprehensive and consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language used throughout it. What this really means, is that you can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying architecture of the system. While Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), they don’t compare with the formal aspects, sophistication and maturity of the ECL language which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.
One of the significant limitations of the strict MapReduce model utilized by Hadoop, is the fact that internode communication is left to the Shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (as they need to go through multiple phases of Map, Shuffle and Reduce, each one of these representing a barrier operation that forces the serialization of the long tails of execution). In contrast, the HPCC Systems platform provide for direct inter-node communication at all times, which is leveraged by many of the high level ECL primitives. Another disadvantage for Hadoop is the use of Java as the programming language for the entire platform, including the HDFS distributed filesystem, which adds for overhead from the JVM; in contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (we have seen anywhere between 3 and 10 times faster execution on HPCC, compared to Hadoop, on the exact same hardware).
The HPCC Systems platform, as you probably saw, has two components: a back-end batch oriented data workflow processing and analytics system called Thor (equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie that you have with Hadoop is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie, on the other hand, allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.
But above all, the HPCC Systems platform presents the users with a homogeneous platform which is production ready and has been proven for many years in our own data services, from a company which has been in the Big Data Analytics business even before Big Data was called Big Data.
As I was preparing the Keynote that I delivered at World-Comp'12, about Machine Learning on the HPCC Systems platform, it occurred to me that it was important to remark that when dealing with big data and machine learning, most of the time and effort is usually spent on the data ETL (Extraction, Transformation and Loading) and feature extraction process, and not on the specific learning algorithm applied. The main reason is that while, for example, selecting a particular classifier over another could raise your F score by a few percentage points, not selecting the correct features, or failing to cleanse and normalize the data properly can decrease the overall effectiveness and increase the learning error dramatically.
This process can be especially challenging when the data used to train the model, in the case of supervised learning, or that needs to be subject to the clustering algorithm, in the case of, for example, a segmentation problem, is large. Profiling, parsing, cleansing, normalizing, standardizing and extracting features from large datasets can be extremely time consuming without the right tools. To make things worse, it can be very inefficient to move data during the process, just because the ETL portion is performed on a system different to the one executing the machine learning algorithms.
While all these operations can be parallelized across entire datasets to reduce the execution time, there don't seem to be many cohesive options available to the open source community. Most (or all) open source solutions tend to focus on one aspect of the process, and there are entire segments of it, such as data profiling, where there seem to be no options at all.
Fortunately, the HPCC Systems platform includes all these capabilities, together with a comprehensive data workflow management system. Dirty data ingested on Thor can be profiled, parsed, cleansed, normalized and standardized in place, using either ECL, or some of the higher level tools available, such as SALT (see this earlier post) and Pentaho Kettle (see this page). And the same tools provide for distributed feature extraction and several distributed machine learning algorithms, making the HPCC Systems platform the open source one stop shop for all your big data analytics needs.
If you want to know more, head over to our HPCC Systems Machine Learning page and take a look for yourself.
More than 12 years ago, back in 2000, LexisNexis was pushing the envelope on what could be done to process and analyze large amounts of data with commercially available solutions at the time. The overall data size, combined with the large number of records and the complexity of the processing required made existing solutions non-viable. As a result, LexisNexis invented, from the ground up, a data-intensive supercomputer based on a parallel share-nothing architecture running on commodity hardware, which ultimately became the HPCC Systems platform.
To put this in a time perspective, it wasn't until 2004 (several years later) that a pair of researchers from Google published a paper on the MapReduce processing model, which fueled Hadoop a few years later.
The HPCC Systems platform was originally designed, tested and refined to specifically address big data problems. It can perform complex processing of billions (or even trillions) of records, allowing users to run analytics in their entire data repository, without resorting to sampling and/or aggregates. Its real-time data delivery and analytics engine (Roxie) can handle thousands of simultaneous transactions, even on complex analytical models.
As part of the original design, the HPCC Systems platform can handle disparate data sources, with changing data formats, incomplete content, fuzzy matching and linking, etc., which are paramount to LexisNexis proprietary flagship linking technology known as LexID(sm).
But it is thanks to ECL, the high-level data-oriented declarative programming language powering the HPCC Systems platform, that this technology is truly unique. With advanced concepts such as data and code encapsulation, lazy evaluation, prevention of side effects, implicit parallelism and code reuse and extensibility, is that data scientists can focus on what needs to be done, rather on superfluous details around the specific implementation. These characteristics make the HPCC Systems platform significantly more efficient than anything else available in the marketplace.
Last June, almost a year ago, LexisNexis decided to release its supercomputing platform, under the HPCC Systems name, giving enterprises the benefit of an open source data intensive supercomputer that can solve large and complex data challenges. One year later, HPCC Systems has made a name for itself and built an impressive Community. Moreover, the HPCC Systems platform has been named one of the top five "start-ups" to watch and has been included in a recent Gartner 2012 Cool IT Vendors report.
LexisNexis has made an impact in the marketplace with its strategic decision to open source the HPCC Systems platform: a bold and innovative decision that can only arise from a Company which prides itself of being a thought leader, when it comes to Technology and Big Data analytics.
One of our community members recently asked about fraud detection using the HPCC Systems platform. The case that this person described involved identifying potentially fraudulent traders, who were performing a significant number of transactions over a relatively short time period. As I was responding to this post in our forums, and trying to keep the answer concise enough to fit in the forums format, I thought that it would be useful to have a slightly more extensive post, around ideas and concepts when designing an anomaly detection system on the HPCC Systems platform.
For this purpose I'll asume that, while it's possibly viable to come up with a certain number of rules to define how normal activity looks like even though the number of rules could be large, it's probably unfeasible to come up with rules that would describe every potential anomalous behavior (fraudsters can be very creative!). I will also assume that while, in certain cases, individual transactions could be flagged as anomalous due to characteristics in the particular data record, in the most common case, it is through aggregates and statistical deviations that an anomaly can be identified.
The first thing to define is the number of significant dimensions (or features) the data has. If there is one dimension (or very few dimensions), where most of the significant variability occurs, it could be conceivable to manually define rules that, for example, would mark transactions beyond 3 or 4 sigma (standard deviations from the mean for the particular dimension) as suspicious. Unfortunately, things are not always so simple.
Generally, there are multiple dimensions, and identifying by hand those that are the most relevant can be tricky. In other cases, performing some type of time series correlation can identify suspicious cases (for example, in the case of logs for a web application, seeing that someone has logged in from two locations a thousand miles apart in a short time frame could be a useful indicator). Fortunately, there are certain machine learning methodologies that can come to the rescue.
One way to tackle this problem is to assume that we can use historical data on good cases to train a statistical model (remember that bad cases are scarce and too variable). This is known as a semi-supervised learning technique, where you train your model only on the "normal" activity and expect to detect anomalous cases that exhibit characteristics which are different from the "norm". One specific method that can be used for this purpose is called PCA (Principal Components Analysis), which can automatically reduce the number of dimensions to those that present the largest significance (there is a loss of information as a consequence of this reduction, but this tends to be minimal compared to the value of reducing the computational complexity). KDA (Kernel Density Estimation) is another semi-supervised method to identify outliers. On the HPCC Systems Platform, PCA is supported through our ECL-ML machine learning module. KDA is currently available on HPCC through the ECL integration with Paperboat .
A possible more interesting approach is to use a completely unsupervised learning methodology. Using a clustering technique such as agglomerative hierarchical clustering, supported in HPCC, as part of the ECL-ML machine learning module, can help identify those events which don't clusterize easily. Other clustering method also available on ECL-ML, k-means, is less effective as it requires to define the number of centroids a priori, which could be very difficult. When using agglomerative hierarchical clustering, one of the aspects that could require some experimentation is to identify the number of iterations required to have the best effectiveness: too many iterations and there will be no outliers as all the data will be clusterized, too few iterations and many normal cases could still be outside of the clusters.
Beyond these specific techniques, the best possible approach probably includes a combination of methods. If there are clear rules that can quickly identify suspicious case, those could be used to validate or rule out results from statistical algorithms, and since a strictly rules based system would be ineffective to detect every possible outlier, using some of the machine learning methodologies described above too, would be highly recommended.
You probably thought that the HPCC Systems platform and Hadoop were two technologies that represented the opposite ends of a spectrum, and that choosing one would make attempting to use the other, unrealistic. If this is what you believed: think again (and keep reading).
The HPCC Systems platform has just released its Hadoop data integration connector. The HPCC/HDFS integration connector provides a way to seamlessly access data stored in your HDFS distributed filesystem from within the Thor component of HPCC. And, as an added bonus, it also allows you to write to HDFS from within Thor.
As you can see, this new feature enables several opportunities to leverage HPCC components from within your existing Hadoop cluster. One such application would be to plug the Roxie real-time distributed data analytics and delivery system, providing real time access to complex data queries and analytics, to data processed in your Hadoop cluster. It would also allow you to leverage the distributed machine learning and linear algebra libraries that the HPCC platform offers through its ECL-ML (ECL Machine Learning) module. And if you needed a highly efficient and highly reliable data workflow processing system, you could take advantage of the HPCC Systems platform and ECL, or even combine it with Pentaho Kettle/Spoon, to add a graphical interface to ETL and data integration.
So what does it take to use the HPCC/HDFS connector (or H2H, as we like to call it)? Not much! The H2H connector has been packaged to include all the necessary components, which are to be deployed to every HPCC node. HPCC can coexist with Hadoop, or run on a different set of nodes (which is normally recommended for performance reasons).
How did we do it? We leveraged the capabilities of ECL to pipe data in and out of a running workunit, through the ECLPipe command, and we created some clever ECL Macros (did I mention before that ECL Macros are awesome?) to provide for adequate data and function mappings from within an ECL program. Thanks to this, using H2H is transparent to the ECL software developer, and HDFS becomes just an option of a particular type of data repository.
What are the gotchas? Well, HDFS is not as efficient as the distributed filesystem used by HPCC, so this data read and write will not be any faster than HDFS allows (but it won't be sensibly slower either). Another caveat is that transparent access to compressed data (as it's normally provided by HPCC) is not available to data accessed from within HDFS (although decompression can be achieved easily in a following step, after the data is read).
I hope you are as excited as we are, about this HPCC/Hadoop data integration initiative. Please take a look at the H2H section of our HPCC Systems portal for more information: http://hpccsystems.com/H2H, and don't hesitate to send us your feedback. This HPCC/HDFS connector is still in beta stage, but we expect to have a 1.0 release very soon.
It is not uncommon to find situations where a classification model needs to be trained using a very large amount of historic data, but the ability to perform classification of new data in real time is required. There are many examples of this need, from real time sentiment analysis in tweets or news, to anomaly detection for fraud or fault identification. The common theme in all these cases is that the value of the real time data feeds has a steep decrease over time, and delayed decisions taken on this data are significantly less effective.
When faced with this challenge, traditional platforms tend to fall short of expectations. Those platforms that can deal with significant amounts of historical data and a very large number of features to create classification models (Hadoop is an example of such a platform), have no good option for real time classification using these models. This type of problems are quite common, for example, in text classification. In these cases, People usually need to resort to different tools, and even homegrown systems using Python and a myriad of other tools, to cope with this real time need.
The problem with these homegrown tools, is that they need to meet all the concurrency and availability requirements that real time systems impose, as these online systems are usually critical to fulfill important internal or external roles for the business (the one anomaly that you just missed because your real time classifier didn't work properly, could represent significant losses for the business).
What makes this even more challenging is the fact that, many times, it is desirable to retrieve and compare specific examples from the training set used to create the model, in real time too. And while developing a system that can classify data in real time using a pre-existing model may be quite doable, being able to also retrieve analogous or related cases would certainly require coupling the system with a database of sorts (just another moving part that adds complexity and cost to the system and potentially reduces its overall reliability).
But look no more, as the HPCC Systems platform may be just what you have been looking for all along: a consistent and homogeneous platform that provides for both functions, and a seamless workflow to move new and updated models, from the system where they are developed (Thor), to the real time classifier (Roxie).
At this point, it's probably worth explaining a little bit how Roxie works. Roxie is a distributed, highly concurrent and highly available, programmable data delivery system. Data queries (in a way equivalent to the stored procedures in your legacy RDBMS) are coded using ECL, which is the same high level data-oriented declarative programming language that powers Thor. Roxie is built for the most stringent high availability requirements, and the data and system redundancy factor is defined by the user at configuration time, with no single point of failure across the entire system. ECL code developed can be reused across both systems, Thor and Roxie.
A scenario like the one I described above, can be easily implemented in the HPCC Systems platform, using one of the classifiers provided by the ECL-ML (ECL Machine Learning) modules on Thor, and running your entire historical training set. To make this even more compelling, all the classifiers in ECL-ML have been designed with a common interface in mind, so plugging in a different classifier (for example, switching from a generative to a discriminative model) is as simple as changing a single line of ECL code. After a model (or several) is created, it can be tested on a test and/or verification set to validate it, and moved to Roxie for real time classification and matching. The entire training set can also be indexed and moved to Roxie, if real time retrieval of related records is required.
Powerful, simple, elegant, reliable. And every one of these components are available under an open source license, for you to play with.
For more information, head over to our HPCC Systems portal (http://hpccsystems.com).
At HPCC Systems we have been very busy finding better ways to communicate with our Community. As a result of this, we have just released the first edition of our official HPCC Systems podcast, in which the Host and our Program Manager, Trish McCall, has a conversation with our senior trainer Bob Foreman around different aspects of the HPCC Systems platform, the ECL data-intensive programming language and some other topics that we hope you will find interesting.
In upcoming editions, we plan on having guests (Hint, hint! Let us know if you would like to be one of them!) covering new developments and the roadmap for HPCC Systems, discussions on specific capabilities around Machine Learning and Natural Language Processing, some coverage on SALT, our Scalable Automated Linking Technology, and much more.
For this first edition, Trish and Bob tried hard to keep the content under 30 minutes, which is just about perfect for a medium sized commute.
Don't waste a minute and head over to our podcasts page, or find it in iTunes. Please send us feedback and don't forget to rate it in iTunes, if you like it.
Don't be surprised by the title: I'm not trying to play down the link between high blood pressure and a diet rich in Sodium. In the HPCC Systems platform world, SALT has a completely different meaning.
SALT is an acronym for Scalable Automated Linking Technology, and it's a programming environment support tool which functions as an ECL code generator to automatically produce ECL code for a variety of data integration applications, addressing common data processes based on a small configuration file of user-defined specification statements.
SALT has been designed from the ground up, as a system for effective record linking and clustering, but it also has capabilities around data ingest, data profiling, data hygiene, data source consistency monitoring and data quality, and data update management. In addition to this, SALT can generate the inverted data records required for Boolean Search Engine applications.
SALT offers many advantages when developing a new data-oriented application. SALT encapsulates a significant amount of ECL programming knowledge, experience, and best practices for the types of applications supported and can result in significant increases in developer productivity. It affords significant reductions in implementation time and cost over a hand-coded approach.
In case you wonder how this all works in practice, the SALT process begins with a user defined specification file. This is a text file with statements and parameters that define the data file and fields to be processed, and the associated processing options such as the module into which the generated code will be imported. The SALT command line interface reads the specification file, and based on various command line options, generates an output file in ECL, which can be easily imported into an existing ECL project. The resulting file includes attributes which can be executed on a Data Refinery (THOR) to perform the process for which the code has been generated, such as linking the records in a file. Depending on the process, SALT also generates code to define and build appropriate key files and queries for deployment to the Rapid Data Delivery Engine (Roxie).
In general, record linking fits into a general class of data processing known as data integration, which can be defined as the problem of combining information from multiple heterogeneous data sources. Data integration can include data preparation steps such as parsing, profiling, cleansing, normalization, and parsing and standardization of the raw input data prior to record linkage to improve the quality of the input data and to make the data more consistent and comparable (these data preparation steps are sometimes referred to as ETL or extract, transform, load). SALT provides data profiling and data hygiene applications to support the data preparation process. In addition SALT provides a general data ingest application which allows input files to be combined or merged with an existing base file. You can also use SALT to generate a parsing and classification engine for unstructured data which can be use for data preparation. The data preparation steps are usually followed by the actual record linking or clustering process. SALT provides applications for several different types of record linking including internal, external, and remote.
Data profiling, data hygiene and data source consistency checking, while key components of the record linking process, have their own value within the data integration process. All of these are also supported by SALT and can be leveraged even when record linking is not a necessary part of a particular data work unit.
SALT uses advanced concepts such as term specificity to determine the relevance/weight of a particular field in the scope of the linking process, and a mathematical model based on the input data, rather than the need for
hand coded user rules, which is key to the overall efficiency of the method.
For more information on SALT, head over to our HPCC Systems portal and take a look for yourself.
While the ECL-ML (ECL Machine Learning) libraries currently support a variety of prevalent algorithms in machine learning, there could always be the need for the one that has not been added just yet. And, the fact that ECL-ML provides a distributed linear algebra library, which greatly simplifies distributed vectorized implementations, is a blessing, but it still requires some coding in ECL to add new algorithms.
Fortunately, the folks over at Ismion, Inc., and particularly Nick Vasiloglou, have done something about it. They have ported their highly optimized PaperBoat library to HPCC, and made the integration so seamless (through some very clever abstraction layer based on ECL Macros) that PaperBoat functions are available as native ECL definitions.
The most interesting aspects of PaperBoat are around efficiency, and I couldn't say it more clearly than the authors themselves:
PaperBoat has been developed based on the C++ template metaprogramming principles. This approach makes PaperBoat easily configurable and efficient. For example the data are stored always on the minimum data precision needed.Every column is stored in the precision the user specifies. Columns with the same precision are stored next to each other. This triggers vectorization speedups offered in any modern preprocessor. Also libraries like BLAS/LAPACK/FLAME can speed up vector operations. Templatization also avoids the virtual function overhead and it allows the compiler to do extensive optimizations, since all the code is available at compile time. Our experiments showed 4x speedup over an implementation of the library with virtual functions. Another feature of PaperBoat is threading. All fundamental algorithms are tasks that are executed asynchronously. Synchronization of tasks is based on a data availability model inspired by datalog. Another advantage of the PaperBoat is the multidimensional indexing structures that can speed up orders of magnitude machine learning algorithms. Multidimensional trees can speed up things in two ways, either by clever stratified sampling algorithms or by clever geometric tricks, that lead to efficient branch and bound pruning.
The list of algorithms currently supported by PaperBoat is also enticing. While there is some overlap with what is currently available on ECL-ML, and ECL-ML could be preferable for massive amounts of data with large number of features, PaperBoat could be a choice for less extreme cases. And the beauty of it, is that the user gets to choose by changing a single line of ECL (or, why not, even run both and compare the results?).
One interesting algorithm that is available under PaperBoat and not ECL-ML yet, for example, is LASSO, a method for regression shrinkage and selection in linear models, which is favored by some people in the scoring and analytics industry (and if you're curious about LASSO, you can check out the original paper here: http://www-stat.stanford.edu/~tibs/lasso/lasso.pdf).
I hope that, by now, I picked up your interest, so don't waste any more time and head over to the HPCC Systems portal, and then check out http://ismion.com/documentation/ecl-pb/index.html for a good tutorial on PaperBoat and ECL.