Webinar Wrap-up: Baselines & Benchmarks –Making Open Source Big Data Analytics Easy
As most of us in big data know, bringing heterogeneous data into a homogenous data warehouse environment is one of the most daunting aspects of any big data implementation. In our December webcast Baselines and Benchmarks, Arjuna Chala, Sr. Director of Special Projects for HPCC Systems®, addressed a frequently asked question by our customers and prospects – How do HPCC Systems and Apache Spark compare in terms of performance?
Even though Apache Spark and HPCC Systems Thor can be thought of as complementary, there is interest in comparing their performance with data analytics-related benchmarks, specifically transformation, cleaning, normalization, and aggregation. Arjuna explained how HPCC Systems Thor’s performance compares to Apache Spark utilizing standard benchmarking methodologies.
During the webcast, he also demonstrated how these benchmarks and HPCC Systems can help you establish new baselines that:
- Improve the speed and accuracy of the transformation, cleaning, normalization, and aggregation processes
- Enable efficient use of developer resources and development budgets
- Facilitate the use of standard hardware, operating systems, and protocols
Arjuna wrapped things up by walking through use cases and answering several interesting questions from listeners.
We invite you to listen to the full webcast at the link below. If you have any questions, please contact Arjuna at arjuna.chala@lexisnexisrisk.com.
Helpful Links:
- Webcast Baselines and Benchmarks
- Slide Presentation
- Technical Whitepaper: Comparison of HPCC Systems® Thor vs Apache Spark Performance on AWS
- Github information: Comparison of HPCC Systems® Thor vs Apache Spark Performance on AWS
If you prefer to read the information, the full webinar transcript follows:
Hello and thank you for joining us today. My name is Jessica Lorti and I’d like to welcome you to our Webcast Baselines and Benchmarks- Making Open Source Big Data Analytics Easy. Our speaker today is Arjuna Chala. He’s the senior director of special projects here at HPCC Systems and he’ll be walking you through some wonderful information about the HPCC [00:00:30] Systems platform as well as some competitive benchmarks. All right Arjuna, I’m going to turn it over to you.
Thank you, Jessica. Good afternoon everyone. Thank you for joining the Baselines and Benchmark Webinar. Today we will address the most frequently asked questions by our customers and prospects. How does [00:01:00] HPCC Systems and Apache Spark compare in terms of performance? Let us begin with our first poll of the day. Our performance benchmarks important to select a platform … Please take some time to respond to the poll. In the meanwhile let us continue. To me, it’s like selecting a football player in the NFL draft. [00:01:30] He’ll only be considered for selection if he can run the 40-yard dash in under 5 seconds. It is the entry criteria, it is not the only criteria. Therefore the answer is yes. Benchmarks are important but only as the entry criteria. Ease of use, [00:02:00] functionality, reliability, scalability, and security could be some other important considerations. What are Baselines? Baselines is about making sure you’re not comparing apples to oranges. For example, it’ll be an apples to oranges comparison if we were to compare [00:02:30] HPCC Systems or Spark to MongoDB hence understanding the baselines is important.
For example, HPCC Systems and Spark are similar and how this solve data problems the technology architecture, the programming model. We will address the details of each in the next few sections. [00:03:00] Let us start with how HPCC Systems and Spark are used to solve data problems. For starters, let us review some of the common data problems. Unclean data like incorrect spelling, incorrect age range, misclassified data like incorrect gender classification, [00:03:30] male vs female, missing values like city or state in an address, duplication like repeating customer names. Datanize where data is not used to solve the problem is recorded, extraction of data from complex content like HTML documents. And how do we deal with terabytes of data which require redundancy, disaster, recovery [00:04:00] etc? And how do we deal with many sources of data coming from subsidiary companies, CRM systems, ERP systems, products systems, logs etc? And also how do we deal with complex technology solutions Hadoop, Spark, HPCC Systems? Do you agree?
Well, we have a slightly different [00:04:30] perspective. While we agree that some of them are clearly problems, others are actually the solutions. For example, ingesting data from multiple sources might be a good thing. It might help in the currency of the data and eliminate in accuracies. Let us consider an example the case where [00:05:00] customers contact information is populated from three different sources. So what is the correct data? Maybe all of them. Maybe the latest recorded address, maybe the one recorded by the most reliable of the sources. So you have more choices to make the decision if you consider more than one source of data. [00:05:30] If you had relied on one source, you’ll have to stick with it even if it is incorrect. In addition, we can infer more information from multiple sources.For example, we could infer the migration path of a customer based on his address change or time. [00:06:00] Let us review a few examples of data problems and potential solutions. Here’s the next poll question. For every new source of data, how many of you execute a data profile report? [00:06:30] Please take some time to respond to the poll. Meanwhile, let us continue by the way previous polls have indicated that less than 10% of developers profile their data. Our experience has shown that data profiling is a very important first step that helps you understand the data. [00:07:00] Let us see how data profiling can be useful. We recently concluded a hackathon where we used simulated telematics data. The simulation was performed on a duration of 24 hours simulating traffic in Cologne Germany with thousands of vehicles [00:07:30] at a one-second rate. The data contained five attributes, relative time, vehicle ID, X and Y in meters and speed in meters per second. When we run the built-in HPCC Systems profile on the data, the partial reports shows [00:08:00] total records for each column in the data, the fill rate, the fill count, the cardinality and the best type in addition to the data patterns for each field.
From this report, we can deduce [00:08:30] we’re dealing with around 700000 vehicles. There are missing values for the Y attributes in two records, there are unexpected type values for the Y and speed columns in some records. There is an [00:09:00] incorrect speed value in one record. So as you can see data profiling report can provide a lot of valuable information about your incoming data. I hope the examples shows why data profiling is very important. It’s poll time again. Is deduplication hard [00:09:30] Previous polls are indicated that more than 60% of developers think it is not too hard. Let us consider an example, Flavio Villanustre Atlanta, and Javio Villanustre Atlanta. These are [00:10:00] two records provided to us to see if they match. Now, if you are literally matching the Flavio Villanustre Atlanta to Javio Villanustre Atlanta, your result would be a no match.
What, is that correct? The correct result is that it is a match because there is only [00:10:30] one Flavio Villanustre in the US and in Atlanta? Knowing the specificity of the name before the matching is executed is very important. Conversely, let us look at another example John Smith Atlanta and John Smith Atlanta. So if you were to match it in a literal sense [00:11:00] you would say both of them actually match. The correct answer is that it is not a match because there are too many John Smiths in the Atlanta area and just with the first name, last name and the city we will not be able to determine if two John Smiths in Atlanta area are actually the same.
Now, here’s another [00:11:30] poll, do you typically depend on a single source for your data? Please take some time to respond to the polls while we continue. The answer is really depends. In most cases, it does not. [00:12:00] Let us consider an example, again we have two records here, Marcia Marsupial and Karen Kangaroo are the records provided to see if they actually match, are they the same person? When we look at data from one source [00:12:30] we have discovered that actually Marcia, the very first record Marcia and Marsupial match in our database but they seem to have a slightly different address. In the same data source, we also notice there’s one more record for the same person but with a different address. And slowly but surely [00:13:00] you can see that the address that Karen Kangaroo has starts converging with what Marcia … The second record here is indicating. And when we compare it to a few more source of data you can see that in the second source Marcia has actually changed her last name to Kangaroo [00:13:30] and she still continues to live at an address that is already familiar to us and lo and behold a few more sources of data and or time we can see how both the records actually converge to being the same person.
So to answer the question, Marcia Marsupial and Karen Kangaroo are actually the same people. [00:14:00] Now let us look at the data maturity process which is used to convert raw data into intelligent data where the intelligence data is used to enable end- [00:14:30] user decisions. In HPCC Systems the data maturity process has three workflow steps. The first step is the collect step which really involves collecting data from as many data sources as we can. You want the same data from a different sources. Why do we do this? [00:15:00] Because data from one source can tend to contain inaccuracies. The holes in the data combining data from multiple sources will enhance the data and remove these holes. This was also shown to you in the earlier example.
The second step [00:15:30] is the Learn step. The learn step is where the data is profiled, cleaned and standardized and then enhanced, deduped and related. An example of enhancing the data would be to either calculate a date field that did not exist in the original data. An example of relating is to associate two company [00:16:00] records because one is a subsidiary of the other hence creating a very rich connected data graph. The third and final step is the decision step. The decision step is about accepting inquiry data and providing results. The results can be in real time or batch. [00:16:30] The decisions step really relies on the data that has been learned from the learn step. You’ll see a few various cases related to this later in the presentation.
Jumping into the technology second base line relates to how similar the technologies are. [00:17:00] Both HPCC Systems and Spark are designed on the divide and conquer principle. That is divide your data into partitions and process them independently and parallel. Why? Because most data operations are inherently parallel. In this simple example, a single data [00:17:30] file is broken down to four partitions. Why? Because there are four processors or executors available to process this data. Now a little more details on how all this applies to HPCC Systems and why the talked about the [00:18:00] workflow steps, the collect step, the learn step and decision step. HPCC Systems has two bluster components, Thor and ROXIE. Thor is specifically architected to accomplish the learn step. Thor is the Bacho-oriented system that is similar in functionality to Spark. Thor can scale from a single node [00:18:30] to thousands of nodes. In contrast to Thor, ROXIE is designed to accomplish the decision step.
ROXIE provides real-time interface similar to Tor, ROXIE can scale from one to hundreds of nodes. Both Thor and ROXIE are programmed using ECL an open language [00:19:00] for data processing. We will cover more on ECL later in the presentation. Now going into a little more details on the Thor cluster, Thor is based on a master-slave architecture. Each [00:19:30] Thor slave process is responsible for processing data partition. Data on Thor is partition by the number of available slave processes. The master process acts as a coordinator for the slave process. Dali is the metadata server and is equivalent to the name node in the Spark and Hadoop world. The [00:20:00] DFU server is responsible for importing data files and partitioning them.
Thor, as you can see, can run on a single server or a multiple server configuration. ROXIE, on the other hand, is based on a peer to peer architecture. [00:20:30] ROXIE is a rest stop service provider so any services that is deployed to ROXIE can be exposed as a rest or stop service. ROXIE has primarily two components a server and a worker just like in a Thor slave, the worker is responsible for processing a partition of the data by. The ROXIE server [00:21:00] component, on the other hand, is responsible for processing the client requests and coordinating with the workers. Time to look at a few use cases. In the insurance fraud use case, we will demonstrate how criminals can be identified with big data analytics to accomplish this use [00:21:30] case, data is collected from thousands of sources, for example, court houses, online public records depositories, address repositories, phone repositories, motor vehicle-related records, police records like accidents etc.
In the land phase rich people and business-centric social network graph [00:22:00] is created. Once we have a rich social network graph, it helps us to solve fairly complex problems. Let us review an insurance fraud use case an insurance company detected frequent accidents in and around a hospital. The orange circles indicate accidents. [00:22:30] The smaller circles around orange circles indicate people in the accident. The insurance company suspected fraud in these accidents just because they are occurring very frequently and also are occurring very close to an hospital. Now the insurance company was not able to pinpoint the perpetrators of the fraud [00:23:00] in this use case. So they provided the data to LexisNexis. Once we took the data we were able to lay this data on top of what we had previously learned with regard to the population of the United States because the rich social network graph that was available through the learn [00:23:30] process. The final outcome was that we learned that all the accidents were actually interconnected and were created by two families and the fraudsters identified.
Let’s look at yet another use case. Here we learn how big data analytics helps in [00:24:00] agriculture. Here in the collect process, data is collected from farm management systems for seeding, harvesting, irrigation, soil conditioning etc. Real-time data from sensors, drones, robots and farm machinery. [00:24:30] Weather data product data from manufacturers and data from other experts … Again similar to what we had described earlier with regard to the learn process we create a very rich graph of related data. The decision process determines better outcomes for farmers and consumers. [00:25:00] Hence the analytics creates a better ecosystem to monitor, regulate and improve all outcomes in the ecosystem of the farmers.
The third baseline, both HPCC, and Spark provide a very similar programming [00:25:30] model. Even though they use different programming languages the basic principle is parallel processing on data partitions. ECL is based on the principles of a data flow language. The ECL compiler converts the ECL program into a data flow graph where each node in the graph [00:26:00] is a data operation and the connectors represent the data. This is very similar to how SQL is compiled and executed. Functions in both and Spark are designed to work on partitions. [00:26:30] So we bring up this light again to show you that at the core both HPCC Systems and Spark are based on the similar programming model. This is another frequently asked question how does somebody who know Spark, for example, learn ECL? [00:27:00] So to illustrate that we created a quick mapping table, for example, to transform every record to another record.
In ECL you’d use a project function to simply execute a filter function in ECL it will be a data set and a filter. For a transform every record to one or more other records [00:27:30] you’d use ECL normalized function. Group by is very straightforward in ECL as well. You can use functions for roll up and group or denormalize and group and so forth. So literally for every function that is there in Spark, there is an equivalent function in ECL and many more options are available in ECL that are not there in Spark as well. [00:28:00] So this is a simple example which actually transforms an incoming record which is really a string extracts that information out and then sets up on the other data set which has a well-defined format of first name, last name, and age.
In [00:28:30] fact, I’m going to show you how to execute this in the browser using ECL playground. Let me switch over to my browser, before that I’m going to take the code I hope you can see it. [00:29:00] On the HPCC Systems page, there’s a try now button if you click on that, it’ll take you to a location for the playground where it has explored the ECL playground click on try ECL [00:29:30] and it will bring up the ECL playground. Here I’m just going to paste this code and click on submit. Very easy. So you can see the results straight [00:30:00] away. So again just to go back what we just did we extracted information from records that were strings literally string records and convert it to a more standard form that contains first name, last name, and age. Going back to the presentation [00:30:30] the final benchmark is about performance and this I think is the most interesting part of the presentation is really how does HPCC Systems compare to Spark.
So to accomplish [00:31:00] this benchmark obviously you put a lot of effort into making sure that all the baselines were met. And also some of the assumptions that they got to come and use of the hardware and software were met. So getting into specifics as far as what hardware we use for both the test whether it was for Spark or for HPCC, we use [00:31:30] the r3.2x large instances on AWS each has eight Virtual CPU per instance where one virtual CPU is equal into one hardware thread. Around 64 GB of memory per instance and 160 gigabytes of SSD storage. [00:32:00] So this was a common hardware set up across both HPCC Systems as well as Spark. On the software setup side, we set up the HPCC Systems Thor node which is really equivalent to the Spark cluster with one master node and three slave nodes. We did the same setup with regard to Spark as well. One master node [00:32:30] and three slave nodes.
On the data side, we use almost identical data sets but we tested it on two workloads. One was a 100GB workload, the second was a 200GB workload. Let’s look at 100GB workload test. So if you look at the tests you can clearly see that [00:33:00] Spark outperforms HPCC Systems in the sort by key instance it looks like it is twice as fast but in the count on the count filter almost HPCC Systems and Spark are neck and neck. In the case of the sort by key instance, you can see [00:33:30] the same neck and neck results as well. They’re very close. It looks like Spark is a little faster but HPCC Systems is very close.
Whereas in the aggregate by Key and aggregate by integer you see HPCC Systems outperforms Spark. If you go to the next test which was around 200GB, [00:34:00] you’ll see actually HPCC Systems and Spark are neck and neck by the sort by Key and HPCC Systems continues to better perform on the count filter, sort by key int, aggregate by Key and aggregate by key int as well. So it seems like [00:34:30] as we increase the data HPCC Systems seems to be scaling much better than Spark from the results that we’ve seen. By the way, this result and the test hardness for these results are available on the upsides for you to try and test it as well. So it’s completely open. The links are available on the HPCC Systems site and [00:35:00] you can you can install the software yourself and try it out. Finally, I want to give a shout out to Tim Humphreys from Lexis Nexis, Vivek Nair from NC State the Ph.D. student and James McMullan from LexisNexis were really instrumental in performing these benchmarks. [00:35:30] Thank you.
All right. Arjuna, thank you very much. It was very informative. So at this point, I’d like to let everybody know that we are ready to accept questions. Anyone who has any questions please do submit those there should be a question box at the bottom of the screen and in the meanwhile, I’d also like to tell people that there’s an attachment there, you can certainly download the presentations [00:36:00] for the future if you would like to. All right, Arjuna, it looks like we have our first question here. So the question is which parts of the HPCC Systems platform are open source?
HPCC Systems is completely free and open source. We also provide free training and in fact six months of free support as part of the package. So the short answer is everything is open source.
Alright it [00:36:30] looks like we have another question here, Arjuna. You showed the ECL watch and it looks like there was a lot of information on that screen. Would you be able to show us that again and maybe show us what’s in the ECL watch and what we can do with that?
Sure. [00:37:00] So here in this particular screenshot what we’re seeing is the actually the ECL playground. This is the area where you look at different functions in ECL and examples of those functions for example if I want [00:37:30] to look at a joint function it will populate the sample program of joined function works you can learn from what you see executed and observe the results. And so basically the ECL playground is an area for you to play with [00:38:00] ECL and also observe the results. Another thing that you can actually do here as … Let me try to find a simpler example. So in my presentation, we also talked about how ECL is a data flow programming language in the sense that [00:38:30] the code that you write is converted to a data flow which is shown here on the right-hand side and that is what really gets executed. So each node in the data flow is an operation.
And the data flows from one node to the other right. So this is actually very important when you are debugging complex problems because [00:39:00] observing the data flow you can look at information like Skews, how your data is being optimized, how many parallel steps are there. So it’s a very informative way of debugging your data flow process. So this is really what I’m showing you and what is available on the public playground is the ECL playground area. But [00:39:30] ECL watch. By itself is much more richer than what I’m showing you it has the ability to monitor your work units, it has ability to look at the files on your landing zone, it has ability to import files from your landing zone into the HPCC Systems clusters, deploying code etc. So ECL watch provides you the complete harness for developers to [00:40:00] interact with the HPCC System. I hope that answers your question.
Yes that’s perfect Arjuna. It looks like we have another question from our audience. What are the developer productivity differences between Spark and HPCC? Can you show us a code comparison?
I don’t have a code comparison right here but we can talk through it. We can talk [00:40:30] through some of that. So the key aspect is really in the case of Spark the developer controls the code. It is an imperative programming paradigm. So for example when you have to perallize your workloads and Spark you would have to work on understanding how you have to divide the data, how the partitions [00:41:00] are developed and then you would also have to understand how many executors that need to be allocated to execute these partitions. So the parallelization aspect is left to the developers. So it’s not a very easy task where as in the case of HPCC Systems the parallelization is determined based on [00:41:30] how the data flows through the system. And this is not up to the developers. This parallelization map or data flow is developed by the optimizer by looking at your ECL code and interpreting how the data flows through your ECL code. So this benefits developers greatly because they don’t have to worry about the parallel task and how the whole data task is [00:42:00] parallelized they really worry about how to solve the problem. I hope that makes sense.
Thank you. It looks like we have another question here, Arjuna. Can you give examples of some things that are easy to do in ECL that are much harder in Spark? For example text extraction or anything else?
Yes, the example [00:42:30] I showed earlier with regard to using the project function in terms of how you can extract data was a fairly straightforward way of accomplishing the task of extracting data from text. So as I said ECL provides you well bond set of functions that is easy to use for [00:43:00] accomplishing your data workloads whereas in Spark, most of the times you will have to develop complex functions to accomplish what has already been developed in ECL because in ECL from the ground up, the developers of ECL based the product on real-world problems. So every function solves [00:43:30] a specific useful pattern in the real world. Jessica, I have also a few other questions here.
Yes absolutely.
One of the questions was Can HPCC Systems integrate with HDFS? This is a very interesting question. So we are actually in the process [00:44:00] of working on this integration. So it should be available in the beginning of next quarter. I would say around … I’m thinking it should be available in the February time frame. Something related to that, there’s one more question. What are the resources that are available to learn ECL and HPCC? Okay here, to answer [00:44:30] this question obviously there’s a wealth of information on the hpccsystems.com website. Also, you can enroll in classroom training. There is free online based training which is really it mimics the classroom setup but it’s completely online and it’s free. We also do custom workshops based on requests. So wealth of information there.
[00:45:00] Right. That’s an excellent point. Arjuna he’s going to expand on that. We have a wealth of training materials and a lot of them are online available 24/7 for free still. Please do check those out on our website.
There’s one more question kind of relates to the HDFS integration part. Can Spark consume data from HPCC Systems like it does with [00:45:30] HDFS? Again, this is another connector that we are actually working on currently and it will be available in the first quarter. Those are the questions I had seen from my side.
Arjuna, those are the same ones that I’ve seen. Would you be able to remind us one more time where this is located in git hub? Which repositories is in?
It’s github. [00:46:00] com/hpcc-systems
All right thank you very much. I don’t see any more questions that are coming in. However please do download the presentation we had a technical glitch at the very beginning and our slides didn’t forward appropriately so we weren’t able to show Arjuna’s bio and all of his contact information. But please do download the presentation [00:46:30] and you can see all of that information there. Arjuna can also be reached via his Twitter handle which is @arjunachala or you can certainly reach out to us through the HPCC System’s website through the contact us button. We can make certain that we get any comments that you have for Arjuna to them or additional question or he can also be reached on his email address which is arjuna.chala@lexisnexisrisk.com.
Again I apologize that [00:47:00] we weren’t able to show that slide. Please don’t hesitate to reach out to us. Arjuna is focused on our special projects and so he has very deep technical expertise on the HPCC Systems platform as well as the special implementation that our customers are doing truly throughout the world. So he’s worked on some very amazing projects. If you have questions concerns that he can address. That’s all we have for today. So please [00:47:30] do visit us at hpccsystems.com. Thank you.