You may have already downloaded the HPCC Systems 6.0.0 Beta version and read my earlier blog about the features included. So what's been happening on the HPCC Systems Open Source Project since then...
We've been working on lots more cool new features, usability and performance enhancements and the first chance to get your hands on them is coming soon in April. This release will include everything you have already seen and more, giving you the opportunity to take a look at and get some experience of using all this new stuff in advance of the gold release targeted for later in the year.
As part of HPCC Systems 6.0.0, our very own ECL IDE will go open source for the first time! So now you can contribute to the sources for that as well as the rest of the platform.
Read on to find out what else is coming in HPCC Systems 6.0.0. that will enhance your HPCC Systems experience. JIRA issue numbers are included for additional reference.
You can expect HPCC Systems to perform even better and faster as a result of the following:
Optimized merge sort for large numbers of cores
This is a change to the underlying implementation used for sorting but requires no changes to your ECL code. The goal of the change is for sorts to execute faster and make more efficient use of multiple core processors. In other words, sorts will take less time and place a lower load on the system. It is particularly relevant for systems with large numbers of parallel threads of execution (e.g., the power8 architecture), but also shows significant improvements for intel machines. Here are some numbers which show the improved timings for sorting the rows in memory:
|Intel Xeon / 16 cores|
|qsort||New merge sort|
|Power 8 / 160 execution threads|
|qsort||New merge sort|
Note: The unstable quick sort used in this example is quicker than the sort previously used in the HPCC Systems 5.x.x series.
The new merge sort makes a significant difference especially when there are very large numbers of cores available. The power8 runs faster on the large sorts than the Xeon despite the lower clock rate.
Other factors, other than the time taken to sort in memory, affect the performance of sorting. We are also in the process of introducing improvements in disk reading and parallel execution in HPCC Systems 6.0.0 and will continue to do so in future releases. (see below).
Parallel Activity Execution
There are many examples where ECL operations are CPU bound, but they could execute more quickly if they were executed on multiple CPU cores in parallel. Examples include complicated PROJECTs, parsing text and joining records. In HPCC Systems 6.0.0, the engines are being refactored to make it much easier to allow activities and sections of the graph process to be executed in parallel. This performance enhancement will be available for both Roxie and Thor.
Affinity support in Thor
If you are running more than one Thor slave process on a machine with multiple CPU sockets, you're going to like this improvement. Binding each process to a single socket improves the overall performance by reducing inter socket overheads (especially related to the L3 cache). If the Thor nodes execute slave processes for different Thor instances, this change also allows you to isolate the processes and ensure that one process cannot dominate the others.
Don't forget to check back and read about other performance enhancements outlined in the previous HPCC Systems 6.0.0 blog, including details about the Virtual Slave Thor and DALI replacement for workunit store.
LZ4 compression for temporary files
HPCC Systems 6.0.0 Beta 2 contains a new file compression algorithm called LZ4, which is an open-source, extremely fast, lossless compression algorithm with a fixed, byte-oriented encoding. More info is available from http://www.lz4.org.
LZ4 is enabled by default in Thor for temporary files, such as spills, replacing the RowDiff, LZW or FLZ methods. It should offer improved performance, especially with decompression, compared to previous methods. The compression method for spills can be overridden in ECL with the option parameter, as follows:
Let's take a look at some of the additional work that has been done to make your use of HPCC Systems easier and more efficient:
Ability to merge multiple package files
Managing a packagemap on ROXIE has been improved to make not only the process easier, but also makes that process more easily managed by a distributed team. While a single package file represents a packagemap which must be maintained and coordinated centrally, a multi-file packagemap allows its content to be aggregated from many smaller files. This new approach provides a number of benefits:
- The smaller packages files can be added or removed individually.
- They can be organized locally.
- They can be managed by individual ECL developers or teams so they affect only those queries in their area of responsibility.
This simplifies the job for anyone tasked with managing packagemaps as well as making their use on a ROXIE that is shared by multiple developers or teams much more manageable.
These improvements build on those already available in the HPCC Systems Beta version including, the TRACE activity, Init system improvements, security enhancements etc which are discussed in my previous blog.
Finally, I want to mention some new features that have been developed since the first Beta version was released:
HPCC Visualization Framework
The ability to create visualizations from your data has existed in the HPCC Systems Platform since 5.x.x. From HPCC Systems 6.0.0, we have added a framework that makes is easier for you to place your visualization into a specific place on a web page. There is a framework equivalent for all the chart/graph types that are available in the platform including, bar, scatter, pie, histogram etc but you can also use it for your own hand coded visualizations too.
The framework can also fetch and populate the data from an HPCC Systems Workunit or ROXIE query and what's more, it's really very easy to do this. The accompanying Dermatology page contains properties which you can use to try out different visualizations for your data.
We have some examples which demonstrate this which will be posted in a separate blog dedicated to the HPCC Visualization Framework. So watch this space!
RESTful ROXIE adds native ROXIE support for these additional REST access formats.
Currently, you can use these formats to access ROXIE but only by going through a WsECL service. WsECL provides multiple ways to access ROXE queries including, SOAP, JSON, HTTP-GET, Form-UrlEncoding etc. It also provides access control and built in load balancing.
There is always some overhead in adding a mid-tier component like WsECL and there are times when an application calls for squeezing every bit of performance out of the system. Currently, if a client is developed using SOAP or JSON request formats, the WsECL middle tier can be completely removed and the given requests sent directly to ROXIE.
With RESTful ROXIE support, these additional REST request types, HTTP-GET and Form-UrlEncoded, will also be able to be sent to ROXIE directly.
A new ECL keyword CRITICAL allows you to identify secitons of ECL that will only be executed by one query at a time. Here's a usage example:
// Create new file with keys not already in current file.
processInput(dataset(rawRecord) inFile, string outFileName) := FUNCTION
existingIds := DATASET(idFileName, taggedRecord, THOR);
unmatchedKeys := JOIN(inFile, existingIds, LEFT.key = RIGHT.key, LEFT ONLY, LOOKUP);
maxId := MAX(existingIds, id);
newIdFileName := idFileName + WORKUNIT;
newIds := PROJECT(unmatchedKeys, TRANSFORM(taggedRecord, SELF.id := maxId + COUNTER; SELF := LEFT));
tagged := JOIN(inFile, existingIds + newIds, LEFT.key = RIGHT.key,
TRANSFORM(taggedRecord, SELF.id := RIGHT.id, SELF := LEFT),lookup);
updateIds := OUTPUT(newIds,,newIdFileName);
extendSuper := Std.File.AddSuperFile(idFileName, newIdFileName);
result := SEQUENTIAL(InitializeSuperFile, updateIds, extendSuper): CRITICAL('critical_test');
- ProcessInput needs to update a super file with new records from an incoming file.
- Each of the records from the new incoming file must be assigned a unique 'id'. It does this by reading the super file and working out the largest 'id' number that's currently in use (maxId).
- It then creates a new file with the newly generates id's and adds that file to the super file.
This works fine if there is at most one query carrying out an task like this. If there is more than one query trying to use the same superfile as the basis for new 'id's, then a duplicate 'id' number will be created:
- Query 1 reads the superfile and works out that the last used id is, for example, 123944.
- Query 2 executes at the same time reads the same superfile and works out that the last used id is also 123944.
- Query 1 and Query 2 both generate new records with id 123944, which is a problem because each id must be unique. The new CRITICAL keyword stops both Query 1 and Query 2 executing the given section at the same time.
Security Manager plugin support
Using this new feature, software developers can create their own dynamic security manager plugin, conforming to the HPCC interface and can make that plugin available to the HPCC Configuration Manager. Security Manager plugin developers must provide the Security Manager object as a dynamically loadable shared object along with any dependencies, built for the target platform and a configuration file that identifies any configuration parameters specific to that plugin. An HPCC Systems administrator can select and configure a single security manager component to be deployed, loaded and invoked by the platform.
In HPCC Systems 6.0.0, we are providing the plugin framework. Any settings/information/values would be specific to the plugin. For example, an LDAP security plugin would require OU information and OpenLDAP libraries. An HTPASSWD plugin would require the HTPASSWD file and possibly an encryption library. A custom plugin provided by an ISV will be unique to that custom plugin.
Refresh Boolean option on persist
From HPCC Systems 6.0.0 Beta 2, there will be increased flexibility in the way the persist option works. HPCC Systems can read cached copies of data without requiring that these copies are rebuilt whenever they are out of date.
We have added a new ECL plugin to access Apache Kafka, a publish-subscribe messaging system. ECL string data can be both published to and consumed from Apache Kafka brokers.
More information is available in this readme: https://github.com/hpcc-systems/HPCC-Platform/blob/master/plugins/kafka/README.md
HPCC Systems 6.0.0. Beta 2 will be available on the downloads page in March. If you have any questions or comments about the features mentioned here or in the previous blog, email Lorraine Chapman or post a comment in the developer forum. If you encounter any problems while trying out these features, comment in the relevent JIRA issue using our Community Issue Tracker.