HPCC Systems 6.0.x Feature Highlights Part 2
There’s so much to say about the HPCC Systems 6.0.x release that one blog simply isn’t enough.
If you haven’t read the first one, you can find it here. We have also written a blog specifically about the performance improvements which provide the ability to use various multicore technologies currently available.
You can expect the HPCC Systems 6.0.x series to perform even better and faster as a result of a number of performance improvements. You should also find it easier to use and there are many new features to help you get the most from HPCC Systems.
Take a look at the following performance improvements:
Optimized merge sort for large numbers of cores
This is a change to the underlying implementation used for sorting but requires no changes to your ECL code. The goal of the change is for sorts to execute faster and make more efficient use of multiple core processors. In other words, sorts will take less time and place a lower load on the system. It is particularly relevant for systems with large numbers of parallel threads of execution (e.g. the power8 architecture), but also shows significant improvements for intel machines. Here are some numbers which show the improved timings for sorting the rows in memory:
|Intel Xeon / 16 cores|
|qsort||New merge sort|
|Power 8 / 160 execution threads|
|qsort||New merge sort|
Note: The unstable quick sort used in this example is quicker than the sort previously used in the HPCC Systems 5.x.x series.
The new merge sort makes a significant difference especially when there are very large numbers of cores available. The power8 runs faster on the large sorts than the Xeon despite the lower clock rate.
Other factors, other than the time taken to sort in memory, affect the performance of sorting. We are also in the process of introducing improvements in disk reading and parallel execution in HPCC Systems 6.0.0 and will continue to do so in future releases. (see below).
Parallel Activity Execution
There are many examples where ECL operations are CPU bound, but they could execute more quickly if they were executed on multiple CPU cores in parallel. Examples include complicated PROJECTs, parsing text and joining records. In HPCC Systems 6.0.0, the engines are being refactored to make it much easier to allow activities and sections of the graph process to be executed in parallel. This performance enhancement will be available for Roxie in 6.0.0 and is coming soon for Thor.
Affinity support in Thor
If you are running more than one Thor slave process on a machine with multiple CPU sockets, you’re going to like this improvement. Binding each process to a single socket improves the overall performance by reducing inter socket overheads (especially related to the L3 cache). If the Thor nodes execute slave processes for different Thor instances, this change also allows you to isolate the processes and ensure that one process cannot dominate the others.
New option in Roxie to bind queries to cores
This option has been added because we believe that Roxie will perform better in some circumstances if thread affinities are used to restrict a query’s threads to a subset of cores on a machine. To use this option, either set coresPerQuery in the RoxieTopology.xml or set bindCores in the workunit debug values located in the query XML, to indicate that execution of this query should be bound to, at the most, N cores out of all the ones that the Roxie process itself is bound to.
LZ4 compression for temporary files
HPCC Systems 6.0.0 Beta 2 contains a new file compression algorithm, LZ4, which is an open-source, extremely fast, lossless compression algorithm with a fixed, byte-oriented encoding. More info is available from http://www.lz4.org.
LZ4 is enabled by default in Thor for temporary files, such as spills, replacing the RowDiff, LZW or FLZ methods. It should offer improved performance, especially with decompression, compared to previous methods. The compression method for spills can be overridden in ECL with the option parameter, as follows:
Now let’s take a look at some of the additional work that has been done to make your use of HPCC Systems easier and more efficient:
Manage PackageMap files using multiple parts
Managing a packagemap on ROXIE has been improved to make not only the process easier, but also makes that process more easily managed by a distributed team. While a single package file represents a packagemap which must be maintained and coordinated centrally, a multi-file packagemap allows its content to be aggregated from many smaller files. This new approach provides a number of benefits:
- The smaller packages files can be added or removed individually.
- They can be organized locally.
- They can be managed by individual ECL developers or teams so they affect only those queries in their area of responsibility.
This simplifies the job for anyone tasked with managing packagemaps as well as making their use on a ROXIE that is shared by multiple developers or teams much more manageable.
Other usability improvements include, the TRACE activity, Init system improvements, security enhancements etc which are discussed in my previous blog.
ECL IDE goes open source for the first time
From HPCC Systems 6.0.0, our very own ECL IDE will go open source for the first time! So now you can contribute to the sources for ECL IDE as well as the rest of the platform.
Finally, I want to mention some additonal new features now available in HPCC Systems 6.0.0:
HPCC Visualization Framework
The ability to create visualizations from your data has existed in the HPCC Systems Platform since 5.x.x. From HPCC Systems 6.0.0, we have added a framework that makes is easier for you to place your visualization into a specific place on a web page. There is a framework equivalent for all the chart/graph types that are available in the platform including, bar, scatter, pie, histogram etc but you can also use it for your own hand coded visualizations too.
The framework can also fetch and populate the data from an HPCC Systems Workunit or ROXIE query and what’s more, it’s really very easy to do this. The accompanying Dermatology page contains properties which you can use to try out different visualizations for your data.
We have some examples which demonstrate this which will be posted in a separate blog dedicated to the HPCC Visualization Framework. So watch this space!
RESTful ROXIE adds native ROXIE support for these additional REST access formats.
Currently, you can use these formats to access ROXIE but only by going through a WsECL service. WsECL provides multiple ways to access ROXE queries including, SOAP, JSON, HTTP-GET, Form-UrlEncoding etc. It also provides access control and built in load balancing.
There is always some overhead in adding a mid-tier component like WsECL and there are times when an application calls for squeezing every bit of performance out of the system. Currently, if a client is developed using SOAP or JSON request formats, the WsECL middle tier can be completely removed and the given requests sent directly to ROXIE.
With RESTful ROXIE support, these additional REST request types, HTTP-GET and Form-UrlEncoded, will also be able to be sent to ROXIE directly.
A new ECL keyword CRITICAL allows you to identify secitons of ECL that will only be executed by one query at a time. Here’s a usage example:
// Create new file with keys not already in current file.
processInput(dataset(rawRecord) inFile, string outFileName) := FUNCTION
existingIds := DATASET(idFileName, taggedRecord, THOR);
unmatchedKeys := JOIN(inFile, existingIds, LEFT.key = RIGHT.key, LEFT ONLY, LOOKUP);
maxId := MAX(existingIds, id);
newIdFileName := idFileName + WORKUNIT;
newIds := PROJECT(unmatchedKeys, TRANSFORM(taggedRecord, SELF.id := maxId + COUNTER; SELF := LEFT));
tagged := JOIN(inFile, existingIds + newIds, LEFT.key = RIGHT.key,
TRANSFORM(taggedRecord, SELF.id := RIGHT.id, SELF := LEFT),lookup);
updateIds := OUTPUT(newIds,,newIdFileName);
extendSuper := Std.File.AddSuperFile(idFileName, newIdFileName);
result := SEQUENTIAL(InitializeSuperFile, updateIds, extendSuper): CRITICAL(‘critical_test’);
- ProcessInput needs to update a super file with new records from an incoming file.
- Each of the records from the new incoming file must be assigned a unique ‘id’. It does this by reading the super file and working out the largest ‘id’ number that’s currently in use (maxId).
- It then creates a new file with the newly generates id’s and adds that file to the super file.
This works fine if there is at most one query carrying out an task like this. If there is more than one query trying to use the same superfile as the basis for new ‘id’s, then a duplicate ‘id’ number will be created:
- Query 1 reads the superfile and works out that the last used id is, for example, 123944.
- Query 2 executes at the same time reads the same superfile and works out that the last used id is also 123944.
- Query 1 and Query 2 both generate new records with id 123944, which is a problem because each id must be unique. The new CRITICAL keyword stops both Query 1 and Query 2 executing the given section at the same time.
Security Manager plugin support
Using this new feature, software developers can create their own dynamic security manager plugin, conforming to the HPCC interface and can make that plugin available to the HPCC Configuration Manager. Security Manager plugin developers must provide the Security Manager object as a dynamically loadable shared object along with any dependencies, built for the target platform and a configuration file that identifies any configuration parameters specific to that plugin. An HPCC Systems administrator can select and configure a single security manager component to be deployed, loaded and invoked by the platform.
In HPCC Systems 6.0.0, we are providing the plugin framework. Any settings/information/values would be specific to the plugin. For example, an LDAP security plugin would require OU information and OpenLDAP libraries. An HTPASSWD plugin would require the HTPASSWD file and possibly an encryption library. A custom plugin provided by an ISV will be unique to that custom plugin.
Refresh Boolean option on persist
From HPCC Systems 6.0.0, there will be increased flexibility in the way the persist option works. HPCC Systems can read cached copies of data without requiring that these copies are rebuilt whenever they are out of date.
We have added a new ECL plugin to access Apache Kafka, a publish-subscribe messaging system. ECL string data can be both published to and consumed from Apache Kafka brokers.
More information is available in this readme: https://github.com/hpcc-systems/HPCC-Platform/blob/master/plugins/kafka/README.md
If you have any questions or comments about the features mentioned here, email Lorraine Chapman or post a comment in the developer forum. If you encounter any problems while trying out these features, comment in the relevent JIRA issue using our Community Issue Tracker.
- Download the latest HPCC Systems release and ECL IDE/Client Tools.
- Read the supporting documentation.
- Take a test drive with the HPCC Systems 6.0.x VM.
- Tell us what you think. Post on our Developer Forum.
- Let us know if you encountered problems using our Community Issue Tracker.
- Read a blog about the multicore capabilities of the HPCC Systems 6.0.x series.
- Read the first blog in this series.