HPCC vs Hadoop Detailed Comparison

Printer-friendly versionE-Mail

> The Road from Pig to ECL: The PigMix Benchmark on HPCC
> Four Key Factors Differentiating HPCC from Hadoop
> Why HPCC is a Superior Alternative to Hadoop

Read how the HPCC Platform compares to Hadoop

HPCC Platform
Hardware Type Processing clusters using commodity off-the-shelf (COTS) hardware. Typically rack-mounted blade servers with Intel or AMD processors, local memory and disk connected to a high-speed communications switch (usually Gigabit Ethernet connections) or hierarchy of communications switches depending on the total size of the cluster. Clusters are usually homogeneous (all processors are configured identically), but this is not a requirement.
Operating System
Linux and Windows
System Configurations HPCC clusters can be implemented in two configurations: Data Refinery (Thor) is analogous to the Hadoop MapReduce Cluster; Data Delivery Engine (Roxie) provides separate high-performance online query processing and data warehouse capabilities. Both configurations also function as distributed file systems but are implemented differently based on the intended use to improve performance. HPCC environments typically consist of multiple clusters of both configuration types. Although file systems on each cluster are independent, a cluster can access files within a file system on any other cluster in the same environment. Hadoop system software implements clusters with MapReduce processing paradigm. The cluster also functions as a distributed file system running HDFS. Other capabilities are layered on top of the Hadoop MapReduce and HDFS system software including HBase, Hive, etc.
Licensing & Maintenance Cost Community Edition is Free. Enterprise License fees currently depend on size and type of system configurations. Free, although there are different paid maintenance offerings by multiple vendors.
Core Software For a Thor configuration, core software includes the operating system and various services installed on each node of the cluster to provide job execution and distributed file system access. A separate server called the Dali server provides file system name services and manages workunits for jobs in the HPCC environment. A Thor cluster is also configured with a master node and multiple slave nodes. A Roxie cluster is a peer-coupled cluster where each node runs Server and Agent tasks for query execution and key and file processing. The file system on the Roxie cluster uses a distributed B+Tree to store index and data and provides keyed access to the data. Additional middleware components are required for operation of Thor and Roxie clusters. Core software includes the operating system, Hadoop MapReduce cluster and HDFS software. Each slave node includes a Tasktracker service and Datanode service. A master node includes a Jobtracker service which can be configured as a separate hardware node or run on one of the slave hardware nodes. Likewise, for HDFS, a master Namenode service is also required to provide name services and can be run on one of the slave nodes or a separate node.
Middleware Components Middleware components include an ECL code repository implemented on a MySQL server, and ECL server for compiling of ECL programs and queries, an ECLAgent acting on behalf of a client program to manage the execution of a job on a Thor cluster, an ESPServer (Enterpise Services Platform) providing authentication, logging, security, and other services for the job execution and Web services environment, and the Dali server which functions as the system data store for job workunit information and provides naming services for the distributed file systems. Flexibility exists for running the middleware components on one to several nodes. Multiple copies of these servers can provide redundancy and improve performance. None. Client software can submit jobs directly to the Jobtracker on the master node of the cluster. A Hadoop Workflow Scheduler (HWS) which will run as a server is currently under development to manage jobs which require multiple MapReduce sequences.
System Tools HPCC includes a suite of client and operation tools for managing, maintaining, and monitoring HPCC configurations and environments. These include ECL IDE, the program development environment, an Attribute Migration Tool, Distributed File Utility (DFU), an Environment Configuration Utility, Roxie Configuration Utility. Command line versions are also available. ECLWatch is a Web-based utility program for monitoring the HPCC environment and includes queue management, distributed file system management, job monitoring, and system performance monitoring tools. Additional tools are provided through Web services interfaces. The dfsadmin tool provides information about the state of the file system; fsck is a utility for checking the health of files in HDFS; datanode block scanner periodically verifies all the blocks stored on a datanode; balancer re-distributes blocks from over-utilized datanodes to underutilized datanodes as needed. The MapReduce Web UI includes the JobTracker page which displays information about running and completed jobs; drilling down on a specific job displays detailed information about the job. There is also a Tasks page that displays info about Map and Reduce tasks.
Ease of Deployment Environment configuration tool. A Genesis server provides a central repository to distribute OS level settings, services, and binaries to all net-booted nodes in a configuration. Assisted by online tools provided by 3rd party utilizing wizards. Requires a manual RPM deployment.
Distributed File System The Thor DFS is record-oriented, uses local Linux file system to store file parts. Files are initially loaded (sprayed) across nodes and each node has a single file part which can be empty for each distributed file. Files are divided on even record/document boundaries specified by the user. Master/Slave architecture with name services and file mapping information stored on a separate server. Only one local file per node required to represent a distributed file. Read/write access is supported between clusters configured in the same environment. Utilizing special adapters allows files from external databases such as MySQL to be accessed, allowing transactional data to be integrated with DFS data and incorporated into batch jobs. The Roxie DFS utilizes distributed B+Tree index files containing key information and data stored in local files on each node. Block-oriented, uses large 64MB or 128MB blocks in most installations. Blocks are stored as independent units/local files in the local Unix/Linux file system for the node. Metadata information for blocks is stored in a separate file for each block. Master/Slave architecture with a single Namenode to provide name services and block mapping and multiple Datanodes. Files are divided into blocks and spread across nodes in the cluster. Multiple local files (one containing the block, one containing metadata) for each logical block stored on a node are required to represent a distributed file.
Fault Resilience The DFS for Thor and Roxie stores replicas of file parts on other nodes (configurable) to protect against disk and node failure. Thor system offers either automatic or manual node swap and warm start following a node failure, jobs are restarted from last checkpoint or persist. Replicas are automatically used while copying data to the new node. Roxie system continues running following a node failure with a reduced number of nodes. HDFS stores multiple replicas (user-specified) of data blocks on other nodes (configurable) to protect against disk and node failure with automatic recovery. MapReduce architecture includes speculative execution, when a slow or failed Map task is detected, additional Map tasks are started to recover from node failures.
Job Execution Environment Thor utilizes a Master/Slave processing architecture. Processing steps defined in an ECL job can specify local (data processed separately on each node) or global (data is processed across all nodes) operation. Multiple processing steps for a procedure are executed automatically as part of a single job based on an optimized execution graph for a compiled ECL dataflow program. A single Thor cluster can be configured to run multiple jobs concurrently reducing latency if adequate CPU and memory resources are available on each node. Middleware components including an ECLAgent, ECLServer, and Dali Server provide the client interface and manage execution of the job which is packaged as a workunit. Roxie utilizes a multiple Server/Agent architecture to process ECL programs accessed by queries using Server tasks acting as a manager for each query and multiple Agent tasks as needed to retrieve and process data for the query. Uses MapReduce processing paradigm with input data in key-value pairs. Master/Slave processing architecture. A Jobtracker runs on the master node, and a Tasktracker runs on each of the slave nodes. Map tasks are assigned to input splits of the input file, usually one per block. The number of Reduce tasks is assigned by the user. Map processing is local to assigned node. A shuffle and sort operation is done following Map phase to distribute and sort key-value pairs to Reduce tasks based on key regions so that pairs with identical keys are processed by same Reduce tasks. Multiple MapReduce processing steps are typically required for most procedures and must be sequenced and chained separately by the user or language such as Pig.
Programming Languages ECL is the primary programming language for the HPCC environment. ECL is compiled into optimized C++ which is then compiled into DLLs for execution on the Thor and Roxie platforms. ECL can include inline C++ code encapsulated in functions. External services can be written in any language and compiled into shared libraries of functions callable from ECL. A Pipe interface allows execution of external programs written in any language to be incorporated into jobs. Hadoop MapReduce jobs are usually written in Java. Other languages are supported through a streaming or pipe interface. Other processing environments execute on top of Hadoop MapReduce such as HBase and Hive which have their own language interface. The Pig Latin language and Pig execution environment provides a high-level dataflow language which is then mapped into multiple Java MapReduce jobs.
Integrated Program Development Environment The HPCC platform is provided with ECL IDE, a comprehensive IDE specifically for the ECL language. ECL IDE provides access to shared source code repositories and provides a complete development and testing environment for developing ECL dataflow programs. Access to the ECLWatch tool is built-in, allowing developers to watch job graphs as they are executing. Access to current and historical job workunits is provided allowing developers to easily compare results from one job to the next during development cycles. Hadoop MapReduce utilizes the Java programming language and there are several excellent program development environments for Java including Netbeans and Eclipse which offer plug-ins for access to Hadoop clusters. The Pig environment does not have its own IDE, but instead uses Eclipse and other editing environments for syntax checking. A PigPen add-in for Eclipse provides access to Hadoop Clusters to run Pig programs and additional development capabilities.
Database Capabilities The HPCC platform includes the capability to build multi-key, multi-field (aka compound) indexes on DFS files. These indexes can be used to improve performance and provide keyed access for batch jobs on a Thor system, or be used to support development of queries deployed to Roxie systems. Keyed access to data is supported directly in the ECL language. The basic Hadoop MapReduce system does not provide any keyed access indexed database capabilities. An add-on system for Hadoop called HBase provides a column-oriented database capability with keyed access. A custom script language and Java interface is provided. Access to HBase is not directly supported by the Pig environment and requires user-defined functions or separate MapReduce procedures.
Online Query and Data Warehouse Capabilities The Roxie system configuration in the HPCC platform is specifically designed to provide data warehouse capabilities for structured queries and data analysis applications. Roxie is a high-performance platform capable of supporting thousands of users and providing sub-second response time depending on the application. The basic Hadoop MapReduce system does not provide any data warehouse capabilities. An add-on system for Hadoop called Hive provides data warehouse capabilities and allows HDFS data to be loaded into tables and accessed with an SQL-like language. Access to Hive is not directly supported by the Pig environment and requires user-defined functions or separate MapReduce procedures.
Scalability One to several thousand nodes. In practice, HPCC configurations require significantly fewer nodes to provide the same processing performance as a Hadoop cluster. Sizing of clusters may depend however on the overall storage requirements for the distributed file system. One to thousands of nodes.
Performance The HPCC platform has demonstrated sorting 1 TB on a high-performance 400-node system in 102 seconds. In a recent head-to-head benchmark versus Hadoop on another 400-node system, the HPCC performance was 6 minutes 27 seconds and the Hadoop performance was 25 minutes 28 seconds. This result on the same hardware configuration showed that HPCC was 3.95 times faster than Hadoop for this benchmark. Currently the only available standard performance benchmarks are the sort benchmarks sponsored by http://sortbenchmark.org. Yahoo! has demonstrated sorting 1TB on 1460 nodes in 62 seconds, 100TB using 3452 nodes in 173 minutes, and 1PB using 3658 nodes in 975 minutes.
Training Basic and advanced training classes on ECL programming are offered monthly in several locations or can be conducted on customer premises. Classes are also available online at http://learn.lexisnexis.com/hpcc. A system administration class is also offered and scheduled as needed. A free HPCC VM image with a complete HPCC and ECL learning environment which can be used on a single PC or laptop is also available. Hadoop training is offered through a 3rd party. Both beginning and advanced classes are provided. The advanced class includes Hadoop add-ons including HBase and Pig. Another 3rd party also provides a VMWare based learning environment which can be used on a standard laptop or PC. Online tutorials are also available.
HPCC Platform

Contact Us

email us   Email us
Toll-free   US: 1.877.316.9669
International   Intl: 1.678.694.2200

Sign up to get updates through
our social media channels:

facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

Get Started