Tue Oct 26, 2021 12:35 pm
Login Register Lost Password? Contact Us


How to prove HPCC is truly parallel?

Topics related to the Hadoop Connector or migrating data from Hadoop

Thu Dec 08, 2011 1:53 pm Change Time Zone

This question was submitted from a community member and is a great topic to add in this forum.


How to prove HPCC is truly parallel?
From the Beyond MapReduce section, http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop/Components#beyondmapreduce, there is the following description:

Truly parallel: Unlike Hadoop, nodes of a datagraph can be processed in parallel as data seamlessly flows through them. In Hadoop MapReduce (Java, Pig, Hive, Cascading, etc.) almost every complex data transformation requires a series of MapReduce cycles; each of the phases for these cycles cannot be started until the previous phase has completed for every record, which contributes to the well-known “long tail problem” in Hadoop. HPCC effectively avoids this, which effectively results in higher and predictable performance.

It told us that every complex data transformation requires a series of MapReduce cycles. But it didn't say HPCC how to avoid this issue.

Would you tell me HPCC how to avoid complex data transformation? Is there any diagram like Hadoop MapReduce to show its process flow for HPCC? Is there any example to compare them?
admin
Site Admin
Site Admin
 
Posts: 208
Joined: Thu Jan 27, 2011 10:58 am

Thu Dec 08, 2011 1:57 pm Change Time Zone

The fundamental design concepts in HPCC are not based in the MapReduce paradigm postulated by Google in 2004. As a matter of fact, HPCC predates that paper by a several years.

The idea behind the way data workflows are architected in HPCC is based on high level data primitives (SORT, PROJECT, DISTRIBUTE, JOIN, etc.), exposed through the ECL language, and a powerful optimizer which, at ECL compile time, determines how these operations can be parallelized during execution, and what the execution strategy should be to achieve the highest performance in the system.

ECL is a declarative language, so ideally the programmer doesn’t need to define the control flow of the program. A large number of data operations are commutative in nature, and since transferring (big) data is normally very expensive, the optimizer can, for example, move a filter closer to the beginning to reduce the amount of data that is carried over in subsequent operations. Other optimizations such as lazy execution are also utilized to eliminate throwaway code and data structures.

The specific execution plans vary, depending on how the particular data workflow (ECL program) looks like, and the system provides for a graphical display of the exact execution plan that the optimizer determined to be the most appropriate for that workflow. Once you submit a workunit from the ECL IDE, you can visualize the execution plan for that workunit, and even key metrics in each intermediate step which include number of data records processed, data skews and the specific operation represented. As you can see, a complex execution graph is normally subdivided in multiple subgraphs, and many of those operations are parallelized if there is no need for a synchronization barrier (or if the optimizer thinks that excessive parallelization will affect the overall performance negatively).

It is recommended that you download the Virtual Machine and/or binaries of the platform, http://hpccsystems.com/download, and play with some of the examples that we provide in our portal, to understand how this all works in practice. Although in real life you would never need to tinker with the platform itself, if you feel inclined to seeing how things work under the hood, please feel free to download the C++ source code of the HPCC platform from our GIT repository, https://github.com/hpcc-systems, and take a look at the inner implementation details of the platform and ECL compiler and optimizer.

Another source of reference is the PigMix Benchmark on HPCC:
http://hpccsystems.com/Why-HPCC/HPCC-vs ... pigmix_ecl

Please post a reply if you need any help, or if you have any other questions.
admin
Site Admin
Site Admin
 
Posts: 208
Joined: Thu Jan 27, 2011 10:58 am

Thu Dec 08, 2011 3:18 pm Change Time Zone

Followup question from a community member:

Specific to "which contributes to the well-known “long tail problem” in Hadoop",

Do you have any examples, diagram or description to explain why HPCC doesn't have the long tail problem?
admin
Site Admin
Site Admin
 
Posts: 208
Joined: Thu Jan 27, 2011 10:58 am

Thu Dec 08, 2011 3:19 pm Change Time Zone

There are multiple causes to the “long tails” problem in Hadoop. Some of these causes, related for example to data skews and slow nodes, get amplified by the fact that multiple MapReduce cycles are normally serialized over a single data workflow (when, for example, performing a multi-join, working through a graph traversal problem or executing a clustering algorithm).

HPCC utilizes several mechanisms to minimize the lasting effect of these long tails, including the additional parallelization that was described in the previous post, a record oriented filesystem which ensures that each node receives an approximate similar load (in terms of number of data records processed by each node, even for variable length and/or XML record layouts) and enough instrumentation to make the user aware of the data skew levels at each step in the data workflow execution graph.

Please let us know if you need more information.
admin
Site Admin
Site Admin
 
Posts: 208
Joined: Thu Jan 27, 2011 10:58 am


Return to From Hadoop

Who is online

Users browsing this forum: No registered users and 1 guest