Last week, close to 300 people met in Delray Beach, Florida, to follow an intensive and densely packed agenda full of technology content on the HPCC Systems platform. It was our third annual HPCC Summit event and doubled its attendance from the previous year. Besides the great food, party and accommodations (yeah, I thought I should mention that too), there were many plenary and break-out sessions covering a broad range of areas including HPCC roadmap, new developments, recent enhancements, applications, integration with third party platforms and more.
Some of the presentations were plain brilliant, providing material that would take weeks to fully digest, and many of the attendees will be watching the video recordings of them for months to come (in case you were wondering, we also recorded most of the presentations). And some very funny videos entered our video competition, with the winner implementing an HHPCC (Human HPCC): worth a watch if you're up for some light humor.
An interesting aspect of the conference was the contest on ECL Bundles, which got the competitive juices flowing and brought very good submissions from several community members both, internal to LexisNexis and external too. Being one of the judges proved quite difficult as it was hard to define a single winner and we ended up giving prices (neat iPad minis) to both, the winner and the runner up. We also ended up disqualifying a submission from a clever to-remain-unnamed contestant who decided that it was a good idea to submit work done for his/her regular job as an entry to this contest (btw, the specific piece of work is quite sophisticated and very useful, and you will see it permeate into the platform in our upcoming 4.0.2 release).
And speaking of contributions, as one of the ideas floated within the conference, there is an ongoing effort to create entries of ECL code samples for the Rosetta Code project, so if you have some time to kill and or a neat idea on how to implement one of the code examples in Rosetta Code in ECL, feel free to head over to their site and submit it to their side. These entries will surely be useful to people trying to get started in ECL and/or trying to learn new ECL coding tricks.
We'll continue to make some of the material used during the HPCC Summit 2013 publicly available over the next few weeks, but if you are particularly interested in something, please do not hesitate to ask.
After almost two years of continuous development, version 4.0 of the HPCC Systems platform has finally been released, with an impressive number of exciting new features and capabilities!
But the changes don't stop in the underlying platform, and face lifts have been given to the user interfaces, including a production release of the Eclipse IDE ECL plugin, and a technology preview of the next generation ECL Watch, using the latest web technologies to deliver a more streamlined and consistent user interface.
On the Roxie front, support for JSON queries (in addition to the existing SOAP access) and improvements to the Roxie "Packages" file management make it easier than ever to support Roxie queries.
And the new "ECL Bundles" functionality deserves an honorific mention: now it's very easy to package and distribute bundles of ECL capabilities (modules, functions, etc.) in a consistent manner, supporting encapsulation, versioning, dependencies, updates, licensing information and more. This should facilitate ECL code sharing both, internally and externally, and a public repository of these ECL bundles is already in the works.
4.0 just went out, and I'm already looking forward to what 5.0 will bring!
As the end of the 2012 calendar year approaches, at least for a good chunk of the world (and may come to an end on 12/21 for a crazy bunch), some people start celebrating holidays in different cultures and countries. I consider this season a good time to go over things to come in the HPCC Systems platform arena.
Our 3.8 release is out (3.8.6-4 can be downloaded from here) and, while there may still be minor bug fixes (no, there are never bugs, just "under-appreciated features" that we may want to get rid of), 3.10 is well in the works. 3.10, which should be released in the next few weeks, changes the open source license to Apache 2.0 and brings a number of enhancements to the platform in different areas, as you can see from the commit history in our GitHub source code repository.
4.0, the next major release, is already in the plans and will bring a number of exciting new features, including improvements to our ECL Playground (if you haven’t played with it, I strongly recommend it), significant improvements to ECL watch, support for very fast linear algebra through PB-BLAS (not to be confused with PBLAS) and some other interesting developments, such as more thorough integration with R and reporting tools through external connectors, improved SQL/JDBC connectivity on dynamic Roxie queries for interactive reporting tools, and improvements around documentation and usage examples.
In the next few weeks, also expect to see improvements to our Portal, with the addition of a Wiki for collaborative documentation and a description of our general HPCC Systems roadmap and ongoing projects, to help community members decide if they’d like to join any of these efforts. As part of this move, we are planning to include specific projects that could be good starters for some community members, so please let us know if you would like to tackle any of those.
During 2013, we will be actively working to continue raising awareness for the HPCC Systems platform, and will be specifically focused on community building activities (details coming later). And, from the Exciting Training Department, we are currently working on creating a significant amount of materials for our upcoming MOOC (Massive Open Online Courses), which will help you learn everything that you ever wanted to know about ECL, but were afraid to ask.
And now the shameless plug: Trish and I are tasked with the organization of the first Big Data track as part of the 2013 Symposium on Collaborative Technologies, in cooperation with ACM and IEEE, to happen in May, in San Diego, so please feel free, and a little bit compelled? :) to submit papers, present posters and let us know of any other way that you may want to help.
Now go and enjoy with your families and have a great holiday season! Happy Hacking!
I often get asked about comparing the HPCC Systems platform and Hadoop. As many of you probably know already, there are a number of substantial differences between them, and several of these differences are described here.
In a few words, HPCC and Hadoop are both open source projects released under an Apache 2.0 license, and are free to use, with both leveraging commodity hardware and local storage interconnected through IP networks, allowing for parallel data processing and/or querying across this architecture. But this is where most of the similarities end.
From a timeline perspective, HPCC was originally designed and developed about 12 years ago (1999-2000); our first patent around HPCC technology was even filed back in 2002, and HPCC was in production across our systems back in 2002. To put things in perspective, it wasn’t until December 2004 that the two researchers from Google described the distributed computing model based on Map and Reduce. The Hadoop project didn’t start until 2005, if I remember correctly, and it was around 2006 when it split from Nutch to become its own top level project.
This doesn’t necessarily mean that you couldn’t say that certain HPCC operations don’t use an scatter and gather model (equivalent to Map and Reduce), as applicable, but HPCC was designed under a different paradigm to provide for a comprehensive and consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language used throughout it. What this really means, is that you can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying architecture of the system. While Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), they don’t compare with the formal aspects, sophistication and maturity of the ECL language which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.
One of the significant limitations of the strict MapReduce model utilized by Hadoop, is the fact that internode communication is left to the Shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (as they need to go through multiple phases of Map, Shuffle and Reduce, each one of these representing a barrier operation that forces the serialization of the long tails of execution). In contrast, the HPCC Systems platform provide for direct inter-node communication at all times, which is leveraged by many of the high level ECL primitives. Another disadvantage for Hadoop is the use of Java as the programming language for the entire platform, including the HDFS distributed filesystem, which adds for overhead from the JVM; in contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (we have seen anywhere between 3 and 10 times faster execution on HPCC, compared to Hadoop, on the exact same hardware).
The HPCC Systems platform, as you probably saw, has two components: a back-end batch oriented data workflow processing and analytics system called Thor (equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie that you have with Hadoop is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie, on the other hand, allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.
But above all, the HPCC Systems platform presents the users with a homogeneous platform which is production ready and has been proven for many years in our own data services, from a company which has been in the Big Data Analytics business even before Big Data was called Big Data.
As I was preparing the Keynote that I delivered at World-Comp'12, about Machine Learning on the HPCC Systems platform, it occurred to me that it was important to remark that when dealing with big data and machine learning, most of the time and effort is usually spent on the data ETL (Extraction, Transformation and Loading) and feature extraction process, and not on the specific learning algorithm applied. The main reason is that while, for example, selecting a particular classifier over another could raise your F score by a few percentage points, not selecting the correct features, or failing to cleanse and normalize the data properly can decrease the overall effectiveness and increase the learning error dramatically.
This process can be especially challenging when the data used to train the model, in the case of supervised learning, or that needs to be subject to the clustering algorithm, in the case of, for example, a segmentation problem, is large. Profiling, parsing, cleansing, normalizing, standardizing and extracting features from large datasets can be extremely time consuming without the right tools. To make things worse, it can be very inefficient to move data during the process, just because the ETL portion is performed on a system different to the one executing the machine learning algorithms.
While all these operations can be parallelized across entire datasets to reduce the execution time, there don't seem to be many cohesive options available to the open source community. Most (or all) open source solutions tend to focus on one aspect of the process, and there are entire segments of it, such as data profiling, where there seem to be no options at all.
Fortunately, the HPCC Systems platform includes all these capabilities, together with a comprehensive data workflow management system. Dirty data ingested on Thor can be profiled, parsed, cleansed, normalized and standardized in place, using either ECL, or some of the higher level tools available, such as SALT (see this earlier post) and Pentaho Kettle (see this page). And the same tools provide for distributed feature extraction and several distributed machine learning algorithms, making the HPCC Systems platform the open source one stop shop for all your big data analytics needs.
If you want to know more, head over to our HPCC Systems Machine Learning page and take a look for yourself.
More than 12 years ago, back in 2000, LexisNexis was pushing the envelope on what could be done to process and analyze large amounts of data with commercially available solutions at the time. The overall data size, combined with the large number of records and the complexity of the processing required made existing solutions non-viable. As a result, LexisNexis invented, from the ground up, a data-intensive supercomputer based on a parallel share-nothing architecture running on commodity hardware, which ultimately became the HPCC Systems platform.
To put this in a time perspective, it wasn't until 2004 (several years later) that a pair of researchers from Google published a paper on the MapReduce processing model, which fueled Hadoop a few years later.
The HPCC Systems platform was originally designed, tested and refined to specifically address big data problems. It can perform complex processing of billions (or even trillions) of records, allowing users to run analytics in their entire data repository, without resorting to sampling and/or aggregates. Its real-time data delivery and analytics engine (Roxie) can handle thousands of simultaneous transactions, even on complex analytical models.
As part of the original design, the HPCC Systems platform can handle disparate data sources, with changing data formats, incomplete content, fuzzy matching and linking, etc., which are paramount to LexisNexis proprietary flagship linking technology known as LexID(sm).
But it is thanks to ECL, the high-level data-oriented declarative programming language powering the HPCC Systems platform, that this technology is truly unique. With advanced concepts such as data and code encapsulation, lazy evaluation, prevention of side effects, implicit parallelism and code reuse and extensibility, is that data scientists can focus on what needs to be done, rather on superfluous details around the specific implementation. These characteristics make the HPCC Systems platform significantly more efficient than anything else available in the marketplace.
Last June, almost a year ago, LexisNexis decided to release its supercomputing platform, under the HPCC Systems name, giving enterprises the benefit of an open source data intensive supercomputer that can solve large and complex data challenges. One year later, HPCC Systems has made a name for itself and built an impressive Community. Moreover, the HPCC Systems platform has been named one of the top five "start-ups" to watch and has been included in a recent Gartner 2012 Cool IT Vendors report.
LexisNexis has made an impact in the marketplace with its strategic decision to open source the HPCC Systems platform: a bold and innovative decision that can only arise from a Company which prides itself of being a thought leader, when it comes to Technology and Big Data analytics.
One of our community members recently asked about fraud detection using the HPCC Systems platform. The case that this person described involved identifying potentially fraudulent traders, who were performing a significant number of transactions over a relatively short time period. As I was responding to this post in our forums, and trying to keep the answer concise enough to fit in the forums format, I thought that it would be useful to have a slightly more extensive post, around ideas and concepts when designing an anomaly detection system on the HPCC Systems platform.
For this purpose I'll asume that, while it's possibly viable to come up with a certain number of rules to define how normal activity looks like even though the number of rules could be large, it's probably unfeasible to come up with rules that would describe every potential anomalous behavior (fraudsters can be very creative!). I will also assume that while, in certain cases, individual transactions could be flagged as anomalous due to characteristics in the particular data record, in the most common case, it is through aggregates and statistical deviations that an anomaly can be identified.
The first thing to define is the number of significant dimensions (or features) the data has. If there is one dimension (or very few dimensions), where most of the significant variability occurs, it could be conceivable to manually define rules that, for example, would mark transactions beyond 3 or 4 sigma (standard deviations from the mean for the particular dimension) as suspicious. Unfortunately, things are not always so simple.
Generally, there are multiple dimensions, and identifying by hand those that are the most relevant can be tricky. In other cases, performing some type of time series correlation can identify suspicious cases (for example, in the case of logs for a web application, seeing that someone has logged in from two locations a thousand miles apart in a short time frame could be a useful indicator). Fortunately, there are certain machine learning methodologies that can come to the rescue.
One way to tackle this problem is to assume that we can use historical data on good cases to train a statistical model (remember that bad cases are scarce and too variable). This is known as a semi-supervised learning technique, where you train your model only on the "normal" activity and expect to detect anomalous cases that exhibit characteristics which are different from the "norm". One specific method that can be used for this purpose is called PCA (Principal Components Analysis), which can automatically reduce the number of dimensions to those that present the largest significance (there is a loss of information as a consequence of this reduction, but this tends to be minimal compared to the value of reducing the computational complexity). KDA (Kernel Density Estimation) is another semi-supervised method to identify outliers. On the HPCC Systems Platform, PCA is supported through our ECL-ML machine learning module. KDA is currently available on HPCC through the ECL integration with Paperboat .
A possible more interesting approach is to use a completely unsupervised learning methodology. Using a clustering technique such as agglomerative hierarchical clustering, supported in HPCC, as part of the ECL-ML machine learning module, can help identify those events which don't clusterize easily. Other clustering method also available on ECL-ML, k-means, is less effective as it requires to define the number of centroids a priori, which could be very difficult. When using agglomerative hierarchical clustering, one of the aspects that could require some experimentation is to identify the number of iterations required to have the best effectiveness: too many iterations and there will be no outliers as all the data will be clusterized, too few iterations and many normal cases could still be outside of the clusters.
Beyond these specific techniques, the best possible approach probably includes a combination of methods. If there are clear rules that can quickly identify suspicious case, those could be used to validate or rule out results from statistical algorithms, and since a strictly rules based system would be ineffective to detect every possible outlier, using some of the machine learning methodologies described above too, would be highly recommended.
You probably thought that the HPCC Systems platform and Hadoop were two technologies that represented the opposite ends of a spectrum, and that choosing one would make attempting to use the other, unrealistic. If this is what you believed: think again (and keep reading).
The HPCC Systems platform has just released its Hadoop data integration connector. The HPCC/HDFS integration connector provides a way to seamlessly access data stored in your HDFS distributed filesystem from within the Thor component of HPCC. And, as an added bonus, it also allows you to write to HDFS from within Thor.
As you can see, this new feature enables several opportunities to leverage HPCC components from within your existing Hadoop cluster. One such application would be to plug the Roxie real-time distributed data analytics and delivery system, providing real time access to complex data queries and analytics, to data processed in your Hadoop cluster. It would also allow you to leverage the distributed machine learning and linear algebra libraries that the HPCC platform offers through its ECL-ML (ECL Machine Learning) module. And if you needed a highly efficient and highly reliable data workflow processing system, you could take advantage of the HPCC Systems platform and ECL, or even combine it with Pentaho Kettle/Spoon, to add a graphical interface to ETL and data integration.
So what does it take to use the HPCC/HDFS connector (or H2H, as we like to call it)? Not much! The H2H connector has been packaged to include all the necessary components, which are to be deployed to every HPCC node. HPCC can coexist with Hadoop, or run on a different set of nodes (which is normally recommended for performance reasons).
How did we do it? We leveraged the capabilities of ECL to pipe data in and out of a running workunit, through the ECLPipe command, and we created some clever ECL Macros (did I mention before that ECL Macros are awesome?) to provide for adequate data and function mappings from within an ECL program. Thanks to this, using H2H is transparent to the ECL software developer, and HDFS becomes just an option of a particular type of data repository.
What are the gotchas? Well, HDFS is not as efficient as the distributed filesystem used by HPCC, so this data read and write will not be any faster than HDFS allows (but it won't be sensibly slower either). Another caveat is that transparent access to compressed data (as it's normally provided by HPCC) is not available to data accessed from within HDFS (although decompression can be achieved easily in a following step, after the data is read).
I hope you are as excited as we are, about this HPCC/Hadoop data integration initiative. Please take a look at the H2H section of our HPCC Systems portal for more information: http://hpccsystems.com/H2H, and don't hesitate to send us your feedback. This HPCC/HDFS connector is still in beta stage, but we expect to have a 1.0 release very soon.
It is not uncommon to find situations where a classification model needs to be trained using a very large amount of historic data, but the ability to perform classification of new data in real time is required. There are many examples of this need, from real time sentiment analysis in tweets or news, to anomaly detection for fraud or fault identification. The common theme in all these cases is that the value of the real time data feeds has a steep decrease over time, and delayed decisions taken on this data are significantly less effective.
When faced with this challenge, traditional platforms tend to fall short of expectations. Those platforms that can deal with significant amounts of historical data and a very large number of features to create classification models (Hadoop is an example of such a platform), have no good option for real time classification using these models. This type of problems are quite common, for example, in text classification. In these cases, People usually need to resort to different tools, and even homegrown systems using Python and a myriad of other tools, to cope with this real time need.
The problem with these homegrown tools, is that they need to meet all the concurrency and availability requirements that real time systems impose, as these online systems are usually critical to fulfill important internal or external roles for the business (the one anomaly that you just missed because your real time classifier didn't work properly, could represent significant losses for the business).
What makes this even more challenging is the fact that, many times, it is desirable to retrieve and compare specific examples from the training set used to create the model, in real time too. And while developing a system that can classify data in real time using a pre-existing model may be quite doable, being able to also retrieve analogous or related cases would certainly require coupling the system with a database of sorts (just another moving part that adds complexity and cost to the system and potentially reduces its overall reliability).
But look no more, as the HPCC Systems platform may be just what you have been looking for all along: a consistent and homogeneous platform that provides for both functions, and a seamless workflow to move new and updated models, from the system where they are developed (Thor), to the real time classifier (Roxie).
At this point, it's probably worth explaining a little bit how Roxie works. Roxie is a distributed, highly concurrent and highly available, programmable data delivery system. Data queries (in a way equivalent to the stored procedures in your legacy RDBMS) are coded using ECL, which is the same high level data-oriented declarative programming language that powers Thor. Roxie is built for the most stringent high availability requirements, and the data and system redundancy factor is defined by the user at configuration time, with no single point of failure across the entire system. ECL code developed can be reused across both systems, Thor and Roxie.
A scenario like the one I described above, can be easily implemented in the HPCC Systems platform, using one of the classifiers provided by the ECL-ML (ECL Machine Learning) modules on Thor, and running your entire historical training set. To make this even more compelling, all the classifiers in ECL-ML have been designed with a common interface in mind, so plugging in a different classifier (for example, switching from a generative to a discriminative model) is as simple as changing a single line of ECL code. After a model (or several) is created, it can be tested on a test and/or verification set to validate it, and moved to Roxie for real time classification and matching. The entire training set can also be indexed and moved to Roxie, if real time retrieval of related records is required.
Powerful, simple, elegant, reliable. And every one of these components are available under an open source license, for you to play with.
For more information, head over to our HPCC Systems portal (http://hpccsystems.com).
At HPCC Systems we have been very busy finding better ways to communicate with our Community. As a result of this, we have just released the first edition of our official HPCC Systems podcast, in which the Host and our Program Manager, Trish McCall, has a conversation with our senior trainer Bob Foreman around different aspects of the HPCC Systems platform, the ECL data-intensive programming language and some other topics that we hope you will find interesting.
In upcoming editions, we plan on having guests (Hint, hint! Let us know if you would like to be one of them!) covering new developments and the roadmap for HPCC Systems, discussions on specific capabilities around Machine Learning and Natural Language Processing, some coverage on SALT, our Scalable Automated Linking Technology, and much more.
For this first edition, Trish and Bob tried hard to keep the content under 30 minutes, which is just about perfect for a medium sized commute.
Don't waste a minute and head over to our podcasts page, or find it in iTunes. Please send us feedback and don't forget to rate it in iTunes, if you like it.