I’ve read a lot recently about how the IT industry would like to try to perpetuate the Moore’s Law principle that computer power doubles every 18 months. Most seem to agree though, that it can’t continue to happen in the same way given the prohibitive costs of making chips with transistors that have shrunk so small you can’t see them with the naked eye.
It’s now commonplace to find that even your desktop computer has multiple CPUs and/or cores to get better performance than one chip on its own can provide. So it’s no coincidence that HPCC Systems® 6.0.x series, focusses on matching the structure of our software to take advantage of various multicore hardware technologies that are around at the moment.
We pride ourselves on the speed of HPCC Systems and we’re always looking for ways to ‘get more’. We pay close attention to what is going on in the hardware world now and also to what may be coming in the future. But to put all of this into perspective, it’s worth taking a short trip down memory lane.
Way back in what seems like the distant past (around 13 or so years ago), your 400 way Thor (Data Refinery) cluster would have consisted of 400 physical machines each one running its own Thor slave process. Since then, we have made it possible to configure your 400 way Thor as, for example, 40 physical machines with each one running 10 independent Thor slaves, allowing you to take better advantage of multiple cores while saving on cost, space, and power requirements at the same time. But one problem with this configuration is that the independent Thor slave processes aren’t aware of and don’t take advantage of the fact that they are located on the same machine. So they don’t share any memory and the same information may need to be sent to all slave processes.
Now, however, you can have your cake and eat it with HPCC Systems 6.0.x.
Introducing the new virtual-slave Thor
You can now configure Thor clusters so that each physical node has a single Thor slave process, running multiple “virtual” slaves using what we are calling “channels”. The slave channels within a single process can communicate with each other at the speed of memory and can share a single copy of any global data (such as the right hand side of a lookup join) or caches. For some activities, such as the smart/lookup join, having access to all available memory provides significant performance improvements.
There is also the additional benefit that start up and management of the cluster is faster and simpler with this new configuration because there are fewer processes to monitor and maintain.
Affinity support also included
A related (and complementary) new feature in HPCC Systems 6.0.x is the use of CPU affinity to allow Thor slave processes to be bound to specific cores on a machine. Typically, when CPU cores from the same physical CPU share access to memory, the synchronization overhead is a lot smaller than when CPU cores situated on different physical CPUs are sharing memory. On a machine with multiple physical CPU sockets, therefore, it is advantageous to run multiple physical Thor slave processes (one per physical CPU socket, using affinity to restrict it to just the CPU cores on that socket), with each process running multiple virtual slave channels (typically one per CPU core). This way, there is no need for cross-socket memory synchronization, but within a single CPU socket you get all the benefits of sharing resources as described above.
We ran some tests using a join dataset  to get some metrics for you about the performance benefits of using affinity. We ran it using 2 slave processes each with 4 channels providing a total of 8 partitions. When affinity was disabled, total runtime was 278 seconds. When affinity was enabled the same job ran 9% faster taking 255 seconds. We also ran the same job again but this time deliberately setting it up so that each slave allocated its memory always on a remote socket rather than locally. The run time increased to 374 seconds. This shows the increase in performance when using affinity and we expect the benefits to be more pronounced when there are multiple Thor instances running concurrently on the same hardware.
We have enabled the use of affinity by default when running multiple physical slaves on the same machine.
Start using parallel activity execution
We’ve also implemented a new feature to allow activities to process multiple rows in parallel on both Thor and Roxie (Data Delivery Engine). There is a new ECL language keyword, PARALLEL, which you can use to execute part of a graph that would normally have been executed one row at a time, on separate ‘strands’ running in parallel. This is currently only available for the PROJECT activity (and also the PARSE and aggregate activities in Roxie), but we plan to extend this capability to other activities in the near future.
There’s a very definite theme running all the way though these features. We want to make sure that you are getting the most out of all available CPUs to make HPCC Systems and your queries run faster. Today’s servers may have dozens of CPU cores available but tomorrow’s will have hundreds or thousands…
- The affinity metrics were obtained using the join dataset (04ef_joinl1s.ecl) from our Performance Test Suite, which is available as a bundle in our ECL-Bundles GitHub repository.
- Read more about the HPCC Systems multi-slave Thor, affinity support, and PARALLEL keyword.
- Information on other new features also included in theHPCC Systems 6.0.x series can also be found on our website.
- Watch a video discussion about some of the highlights of HPCC Systems 6.0.x.
- Download the latest HPCC Systems 6.0.x release and read the supporting documentation.
- Download the VM and take the latest HPCC Systems 6.0.x release for a test drive.
- Find out more about the usage of the new ECL language PARALLEL keyword using the HPCC Systems ECL Language Reference.