HPCC Systems & Hadoop – a contrast of paradigms
I often get asked about comparing the HPCC Systems platform and Hadoop. As many of you probably know already, there are a number of substantial differences between them, and several of these differences are described here.
In a few words, HPCC and Hadoop are both open source projects released under an Apache 2.0 license, and are free to use, with both leveraging commodity hardware and local storage interconnected through IP networks, allowing for parallel data processing and/or querying across this architecture. But this is where most of the similarities end.
From a timeline perspective, HPCC was originally designed and developed about 12 years ago (1999-2000); our first patent around HPCC technology was even filed back in 2002, and HPCC was in production across our systems back in 2002. To put things in perspective, it wasn’t until December 2004 that the two researchers from Google described the distributed computing model based on Map and Reduce. The Hadoop project didn’t start until 2005, if I remember correctly, and it was around 2006 when it split from Nutch to become its own top level project.
This doesn’t necessarily mean that you couldn’t say that certain HPCC operations don’t use an scatter and gather model (equivalent to Map and Reduce), as applicable, but HPCC was designed under a different paradigm to provide for a comprehensive and consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language used throughout it. What this really means, is that you can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying architecture of the system. While Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), they don’t compare with the formal aspects, sophistication and maturity of the ECL language which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.
One of the significant limitations of the strict MapReduce model utilized by Hadoop, is the fact that internode communication is left to the Shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (as they need to go through multiple phases of Map, Shuffle and Reduce, each one of these representing a barrier operation that forces the serialization of the long tails of execution). In contrast, the HPCC Systems platform provide for direct inter-node communication at all times, which is leveraged by many of the high level ECL primitives. Another disadvantage for Hadoop is the use of Java as the programming language for the entire platform, including the HDFS distributed filesystem, which adds for overhead from the JVM; in contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (we have seen anywhere between 3 and 10 times faster execution on HPCC, compared to Hadoop, on the exact same hardware).
The HPCC Systems platform, as you probably saw, has two components: a back-end batch oriented data workflow processing and analytics system called Thor (equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie that you have with Hadoop is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie, on the other hand, allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.
But above all, the HPCC Systems platform presents the users with a homogeneous platform which is production ready and has been proven for many years in our own data services, from a company which has been in the Big Data Analytics business even before Big Data was called Big Data.