||Processing clusters using commodity off-the-shelf (COTS) hardware. Typically rack-mounted blade servers with Intel or AMD processors, local memory and disk connected to a high-speed communications switch (usually Gigabit Ethernet connections) or hierarchy of communications switches depending on the total size of the cluster. Clusters are usually homogeneous (all processors are configured identically), but this is not a requirement.
||Linux and Windows
||HPCC clusters can be implemented in two configurations: Data Refinery (Thor) is analogous to the Hadoop MapReduce Cluster; Data Delivery Engine (Roxie) provides separate high-performance online query
processing and data warehouse capabilities. Both configurations
also function as distributed file systems but are implemented
differently based on the intended use to improve performance.
HPCC environments typically consist of multiple clusters of both
configuration types. Although file systems on each cluster are
independent, a cluster can access files within a file system on any
other cluster in the same environment.
||Hadoop system software implements clusters with MapReduce processing paradigm.
The cluster also functions as a distributed file system running
HDFS. Other capabilities are layered on top of the Hadoop
MapReduce and HDFS system software including HBase, Hive, etc.
|Licensing & Maintenance Cost
||Community Edition is Free. Enterprise License fees currently depend on size and type of system configurations.
||Free, although there are different paid maintenance offerings by multiple vendors.
||For a Thor configuration, core software includes the operating system and various services installed on each node of the cluster to provide job execution and distributed file system access. A separate server called the Dali server provides file system name services and manages workunits for jobs in the HPCC environment. A Thor cluster is also configured with a master node and multiple slave nodes. A Roxie cluster is a peer-coupled cluster where each node runs Server and Agent tasks for query execution and key and file processing. The file system on the Roxie cluster uses a distributed B+Tree to store index and data and provides keyed access to the data. Additional middleware components are required for operation of Thor and Roxie clusters.
||Core software includes the operating system, Hadoop MapReduce cluster and HDFS software. Each slave node includes a Tasktracker service and Datanode service. A master node includes a Jobtracker service which can be configured as a separate hardware node or run on one of the slave hardware nodes. Likewise, for HDFS, a master Namenode service is also required to provide name services and can be run on one of the slave nodes or a separate node.
||Middleware components include an ECL code repository implemented on a MySQL server, and
ECL server for compiling of ECL programs and queries, an ECLAgent
acting on behalf of a client program to manage the execution of a
job on a Thor cluster, an ESPServer (Enterpise Services Platform)
providing authentication, logging, security, and other services
for the job execution and Web services environment, and the Dali
server which functions as the system data store for job workunit
information and provides naming services for the distributed
file systems. Flexibility exists for running the middleware
components on one to several nodes. Multiple copies of these
servers can provide redundancy and improve performance.
||None. Client software can submit jobs directly to the Jobtracker on the master
node of the cluster. A Hadoop Workflow Scheduler (HWS) which
will run as a server is currently under development to manage
jobs which require multiple MapReduce sequences.
||HPCC includes a suite of client and operation tools for managing, maintaining, and
monitoring HPCC configurations and environments. These include
ECL IDE, the program development environment, an Attribute
Migration Tool, Distributed File Utility (DFU), an Environment
Configuration Utility, Roxie Configuration Utility. Command line
versions are also available. ECLWatch is a Web-based utility
program for monitoring the HPCC environment and includes queue
management, distributed file system management, job monitoring,
and system performance monitoring tools. Additional tools are
provided through Web services interfaces.
||The dfsadmin tool provides information about the state of the file system; fsck is a
utility for checking the health of files in HDFS; datanode block
scanner periodically verifies all the blocks stored on a
datanode; balancer re-distributes blocks from over-utilized
datanodes to underutilized datanodes as needed. The MapReduce
Web UI includes the JobTracker page which displays information
about running and completed jobs; drilling down on a specific job
displays detailed information about the job. There is also a
Tasks page that displays info about Map and Reduce tasks.
|Ease of Deployment
||Environment configuration tool. A Genesis server provides a central
repository to distribute OS level settings, services, and
binaries to all net-booted nodes in a configuration.
||Assisted by online tools provided by 3rd party utilizing wizards. Requires a manual
|Distributed File System
||The Thor DFS is record-oriented, uses local Linux file system to store file parts.
Files are initially loaded (sprayed) across nodes and each node
has a single file part which can be empty for each distributed
file. Files are divided on even record/document boundaries
specified by the user. Master/Slave architecture with name
services and file mapping information stored on a separate
server. Only one local file per node required to represent a
distributed file. Read/write access is supported between
clusters configured in the same environment. Utilizing special
adapters allows files from external databases such as MySQL to be
accessed, allowing transactional data to be integrated with DFS
data and incorporated into batch jobs. The Roxie DFS utilizes
distributed B+Tree index files containing key information and
data stored in local files on each node.
||Block-oriented, uses large 64MB or 128MB blocks in most installations. Blocks are
stored as independent units/local files in the local Unix/Linux
file system for the node. Metadata information for blocks is
stored in a separate file for each block. Master/Slave
architecture with a single Namenode to provide name services and
block mapping and multiple Datanodes. Files are divided into
blocks and spread across nodes in the cluster. Multiple local
files (one containing the block, one containing metadata) for each
logical block stored on a node are required to represent a
||The DFS for Thor and Roxie stores replicas of file parts on other nodes (configurable)
to protect against disk and node failure. Thor system offers
either automatic or manual node swap and warm start following a
node failure, jobs are restarted from last checkpoint or persist.
Replicas are automatically used while copying data to the new
node. Roxie system continues running following a node failure
with a reduced number of nodes.
||HDFS stores multiple replicas (user-specified) of data blocks on other nodes
(configurable) to protect against disk and node failure with
automatic recovery. MapReduce architecture includes speculative
execution, when a slow or failed Map task is detected, additional
Map tasks are started to recover from node failures.
|Job Execution Environment
||Thor utilizes a Master/Slave processing architecture. Processing steps defined
in an ECL job can specify local (data processed separately on
each node) or global (data is processed across all nodes)
operation. Multiple processing steps for a procedure are
executed automatically as part of a single job based on an
optimized execution graph for a compiled ECL dataflow program. A
single Thor cluster can be configured to run multiple jobs
concurrently reducing latency if adequate CPU and memory
resources are available on each node. Middleware components
including an ECLAgent, ECLServer, and Dali Server provide the
client interface and manage execution of the job which is
packaged as a workunit. Roxie utilizes a multiple Server/Agent
architecture to process ECL programs accessed by queries using
Server tasks acting as a manager for each query and multiple
Agent tasks as needed to retrieve and process data for the query.
||Uses MapReduce processing paradigm with input data in key-value pairs.
Master/Slave processing architecture. A Jobtracker runs on the
master node, and a Tasktracker runs on each of the slave nodes.
Map tasks are assigned to input splits of the input file, usually
one per block. The number of Reduce tasks is assigned by the user.
Map processing is local to assigned node. A shuffle and sort
operation is done following Map phase to distribute and sort
key-value pairs to Reduce tasks based on key regions so that
pairs with identical keys are processed by same Reduce tasks.
Multiple MapReduce processing steps are typically required for
most procedures and must be sequenced and chained separately by
the user or language such as Pig.
||ECL is the primary programming language for the HPCC environment. ECL is compiled
into optimized C++ which is then compiled into DLLs for execution
on the Thor and Roxie platforms. ECL can include inline C++ code
encapsulated in functions. External services can be written in
any language and compiled into shared libraries of functions
callable from ECL. A Pipe interface allows execution of external
programs written in any language to be incorporated into jobs.
||Hadoop MapReduce jobs are usually written in Java. Other languages are supported
through a streaming or pipe interface. Other processing
environments execute on top of Hadoop MapReduce such as HBase and
Hive which have their own language interface. The Pig Latin
language and Pig execution environment provides a high-level
dataflow language which is then mapped into multiple Java
|Integrated Program Development Environment
||The HPCC platform is provided with ECL IDE, a comprehensive IDE specifically for
the ECL language. ECL IDE provides access to shared source
code repositories and provides a complete development and testing
environment for developing ECL dataflow programs. Access to the
ECLWatch tool is built-in, allowing developers to watch job
graphs as they are executing. Access to current and historical
job workunits is provided allowing developers to easily compare
results from one job to the next during development cycles.
||Hadoop MapReduce utilizes the Java programming language and there are several
excellent program development environments for Java including
Netbeans and Eclipse which offer plug-ins for access to Hadoop
clusters. The Pig environment does not have its own IDE, but
instead uses Eclipse and other editing environments for syntax
checking. A PigPen add-in for Eclipse provides access to Hadoop
Clusters to run Pig programs and additional development
||The HPCC platform includes the capability to build multi-key, multi-field (aka compound) indexes on DFS files.
These indexes can be used to improve performance
and provide keyed access for batch jobs on a Thor system, or be
used to support development of queries deployed to Roxie systems.
Keyed access to data is supported directly in the ECL language.
||The basic Hadoop MapReduce system does not provide any keyed access indexed
database capabilities. An add-on system for Hadoop called HBase
provides a column-oriented database capability with keyed access.
A custom script language and Java interface is provided. Access
to HBase is not directly supported by the Pig environment and
requires user-defined functions or separate MapReduce procedures.
|Online Query and Data Warehouse Capabilities
||The Roxie system configuration in the HPCC platform is specifically designed to
provide data warehouse capabilities for structured queries and
data analysis applications. Roxie is a high-performance platform
capable of supporting thousands of users and providing sub-second
response time depending on the application.
||The basic Hadoop MapReduce system does not provide any data warehouse
capabilities. An add-on system for Hadoop called Hive provides
data warehouse capabilities and allows HDFS data to be loaded
into tables and accessed with an SQL-like language. Access to
Hive is not directly supported by the Pig environment and
requires user-defined functions or separate MapReduce procedures.
||One to several thousand nodes. In practice, HPCC configurations require significantly
fewer nodes to provide the same processing performance as a
Hadoop cluster. Sizing of clusters may depend however on the
overall storage requirements for the distributed file system.
||One to thousands of nodes.
||The HPCC platform has demonstrated sorting 1 TB on a high-performance 400-node system in
102 seconds. In a recent head-to-head benchmark versus Hadoop on another 400-node system, the HPCC
performance was 6 minutes 27 seconds and the Hadoop performance
was 25 minutes 28 seconds. This result on the same hardware
configuration showed that HPCC was 3.95 times faster than Hadoop
for this benchmark.
||Currently the only available standard performance benchmarks are the sort benchmarks
sponsored by http://sortbenchmark.org. Yahoo! has demonstrated
sorting 1TB on 1460 nodes in 62 seconds, 100TB using 3452 nodes in 173 minutes, and 1PB using
3658 nodes in 975 minutes.
||Basic and advanced training classes on ECL programming are offered monthly in several locations or can be conducted on customer premises. Classes are also available online at http://learn.lexisnexis.com/hpcc. A system administration class is also offered and scheduled as needed. A free HPCC VM image with a complete HPCC and ECL learning environment which can be used on a single PC or laptop is also available.
||Hadoop training is offered through a 3rd party. Both beginning and advanced classes are provided. The advanced class includes Hadoop add-ons including HBase and Pig. Another 3rd party also provides a VMWare based learning environment which can be used on a standard laptop or PC. Online tutorials are also available.