HPCC Systems Version 9.x Nifty Features and Highlights

The latest release from the HPCC Systems Platform team comes with many noteworthy improvements to both the containerized cloud native and bare metal versions as well as an upgraded user interface in ECL Watch 9.0 and greater. This blog is intended to highlight the newest features and significant changes that come with the 9.0.x, 9.2.x, and 9.4.x releases. 

Please note: It is strongly recommended to run on the latest version of the platform for optimal performance and functionality.  

If you have not explored the Cloud Native Version yet, please use the following resources to get started. 

Below you will find the most noteworthy changes and upgrades. For the full list of changes, see the HPCC Systems 9.x Release notes and visit the HPCC Systems Red Book which contains useful information to help you manage the transition between releases. 

What’s New?…Or…Why upgrade?

The question really is not about what is new with the HPCC Systems platform, but more so why should you upgrade? You may think that because your system runs properly there is not necessarily a reason to upgrade, but you would be majorly missing out on the advantages running on the latest and greatest version HPCC Systems has to offer. 

This past October during the 2023 HPCC Systems Community Summit Welcome and Plenary session, Gavin Halliday SVP and Head of Platform Engineering, LexisNexis Risk Solutions helped kick off the conference by introducing the community to all the new features and improvements that come with the latest build. 

Gavin emphasized the importance of making sure your system is always upgraded to the latest version. It is key to taking advantage of all the features the platform offers. 

It is worth upgrading… In the Cloud 

Optimizing HPCC Systems in the Cloud has been a prime focus for the platform team over the past year. So much so that a few breakout sessions at this year’s HPCC Systems Community Summit were dedicated to being more efficient when running in the cloud. Since this year began even more internal systems have been moved to the cloud. The platform team has overcome many challenges in refining the platform to run at its optimal performance. Below are some highlights of those efforts: 

  • Network Keep Alive – There have been multiple improvements to the implementation of the index node cache to improve performance and reduce contention. To prevent idle connections from closing use Set tcp_keepalive 
  • Memory Limits – The strict memory limits imposed by Kubernetes highlighted a problem that has been in the system for many years. If all the data for thought cannot be held in memory, then there is a stage in the sort process where it reads from a large number of compressed files at the same time, and that could cause excessive memory consumption. 
    On a bare metal system, there is often enough spare memory to get away with it, but k8 strictly enforces the limit, and when the pod used too much, it terminates the worker pod. All the user sees is jobs failing for no clear reason. 
    Once diagnosed, it was relatively simple to fix the problem, but diagnosing it leads on to the next item
  • Error Reporting – The option to save and serialize query dlls in K8s is now added 
  • Logical Files are now automatically compressed in Cloud 
  • Thor Graph and Subgraph Delays – Thor Subgraph stats have been extended to include cpu/memory. In order for Thor subgraphs to help determine the best instance types to run on, various stats are gathered and reported. This includes: 
    • System and user time 
    • Heap and row memory (peak and current) 
    • Number of context switches 

This also optionally adds similar stats to Roxie graphs. 

It is worth upgrading… On Bare Metal 

Please note: Many of the Cloud upgrades above also benefit Bare Metal as well as all of the upgrades listed below are also supported in the Cloud Native Version.

If you are a bare metal user and curious if there is a cloud solution for you, please see the Get Started in the Cloud Wiki Page

  • Parquet format files – The Parquet plugin allows you to read and write parquet files direct from your ECL code
  • Improved dynamic ESDL – Over the last few years, there have been many improvements to ESDL script which allows you to define services in ESP. Over the last year the improvements include support from masking, tracing, error handling and some detailed documentation to make sure you can get the most out of it. The aim is to provide you with all the functionality you need without having to code anything
  • Integrated support for NLP++ – In previous versions of the platform if you wanted to develop and use your own NLP++ analyzers with HPCC Systems you would need to deploy your analyzers in the HPCC Systems plugins directory. With this new feature, this manual process is no longer required. There is now support for embedding knowledge bases into ECL Workunits via manifests so the NLP Plugin engine can use them 
  • Open Telemetry support – Often services presented to a customer are implemented internally as multiple services or microservices. Very quickly you will discover systems where ESPs call Roxie, which calls out to other ESPs or third-party services. If there is a problem with the query, it can be hard to tie all these calls together to locate the calls. That is the problem that open telemetry helps to solve by passing trace and span IDs between the surfaces. The complete picture for the query can be reconstructed, especially in 9.4, the platform preserves and creates those traces and span IDs, and the plan is to provide more open telemetry features in the future. Instrumentation added for hThor, Thor, and Roxie 
  • Spray direct from Zip files – Everything that was needed for the contents of zip files to be sprayed directly has been there for years. All it took was the right question to be asked for it to be tested and quickly implemented to solve some issues
  • HPCC Remote Trust via shared cert authority – Copying client certificates around for establishing MTLS between clusters is complicated.  This establishes a much easier trust model that establishes zones of trust between environments using a share certificate authority
  • Support has been added for bare-metal systems to talk to cloud systems by allowing the client certificates to be configured
  • New EclWatch – With the newest upgrade to ECL watch, there are some significant changes to not only the overall look but also some new options have been added

It is worth upgrading… Performance investigation 

Great emphasis has been given to creating tools to investigate and remedy issues involving performance.  Below are a few examples of these changes in Roxie and Thor. 

Roxie 

The Rapid Online XML Indexing Engine or more simply Roxie is the data delivery engine that produces the user with their desired data after it’s already been hammered by Thor.

The latest version brings important enhancements that enables the Roxie user to obtain more information about query performance: 

  • More Detailed Summary Statistics – There is now an additional option for Roxie test pages to generate summary stats. These stats will then appear on the Roxie Query ECL Watch page and at a minimum make it easy to run a query and view the stats for the run of that query   
  • Detailed query statistics – Please view this breakout session for more information about the new detailed query statistics
  • The ability to request a flame graph for a single query run has been added 
  • Roxie will now provide stats for the socket connection to track down and reduce the large gaps in the time between finishing the query and displaying the complete line 

Now for Thor 

Not necessarily the Norse God of Thunder but the idea remains the same. Thor is the cluster that hammers the data before Roxie delivers it. The following is everything significant to the upgraded version of Thor in terms of information about job performance. 

  • Tracking spill file usage – The Thor deployments on the cloud are using servers with local nvme drives to provide some quick storage for spill files, but that storage is bound to a fixed size. Determining how much space a cluster needs to be able to run the Thor jobs was very difficult before now. Thor tracks the peak spill usage throughout each graph, so you can optimize the number of Thor workers you can place on a single server 
  • Warnings about pod skew or row too large – The way Thor worker pods are distributed between servers is also important. If you have 19 pods on one server and one on another, then the bandwidth to the remote storage from the first server is likely to become saturated but skew in that pod distribution can be very hard to diagnose. Thor provides that information and generates a warning to help you spot problems early
  • Stats for index operations and aborted or failed jobs – Other improvements came about as the performance of particular queries was investigated. The extra information was often added to help us diagnose the causes. For instance, new index caching stats have been added to Thor 
  • Improved the option to see flame graphs on Thor graphs 
  • The check_executes script is updated to capture the kernel ring buffer on abnormal process exits. This will capture information from the OOM event if the process was killed by the OOM killer
  • The pod names used in detection code have been fixed to no longer change the format of worker pod names so the pods are evenly distributed across the nodes
  • Improved KeyedJoin index validation errors to better identify errors by including the filename 

Highlight System Problem to the User

The way that problems are reported to the user are being improved; whether it is poor configuration, Kubernetes not distributing pods evenly, disks becoming full, or other system errors the goal is to make them as visible as possible and highlight them within ECL watch, so they can be spotted and resolved more quickly. 

Cloud Logging Support 

Good logging is often essential for tracking down problems with cloud deployments. You cannot rely on logging to local disks. The platform team is continuing to improve the integration with Log Analytics and Grafana to simplify that task. 

It is worth upgrading… Performance Boost 

You should also upgrade if you want to boost the performance of your Thor jobs or Roxie queries. 

  • Thor Cloud Improvements – Cloud deployments have a clear connection between the time the job takes and the costs to the company. There are plenty of improvements to Thor that were made to reduce the cloud costs but will also boost the performance for bare metal. 
  • Dali optimizations – The same goes for some significant changes to boost the performance of Dali in 9.4, for instance the cost of Dali transactions has been reduced by batching and writing asynchronously. 
  • Support for specifying remote storage instead of DALI – when publishing a package-map or query in the cloud you can’t directly access a remote DALI.  This option allows you to use a configured remote storage specification instead. 
  • IBYTI (I Beat You To It) in bare-metal Roxie – In the newest version the mechanism that the bare-metal roxies use to determine which agent should process a request has been improved.  It has the effect of reducing the overhead of waiting for the IBYTI (I beat you to it) notifications and increases the number of worker threads that are available to perform useful work. 
  • New Index Format – This will contribute to smaller indexes and faster access. More information on the new index formatting can be found by watching the breakout session from the 2023 HPCC Systems Community Day. To support backward compatibility of generated indexes on older platforms the new format is only generated when you add one of the new options for the COMPRESSED() attribute to the INDEX definition or the BUILD action. 

Please note: Indexes using the new format can only be read by systems running 9.0.20 or greater.

And then also some more specialized improvements like: 

  • SOAPCALL Missing Servers – Makes sure that unresponsive servers are spotted as quickly as possible.  
  • ONCE for Roxie Libraries – Ensuring a clean start up for Roxie when ONCE is present. 
  • Better access to Expert Options – There is now a consistent method for controlling all the performance options. 

It is worth upgrading… Security 

For this section it is more about asking yourself the following questions about your security options.  

  • Landing Zone Security – Do you need to make sure your deployment is secure? Are you worried about exfiltration of data from your systems? 
  • Egress rules – Do you need to control which users can import and export data? Do you need to ensure that data can only be exported to restricted set of landing zones or are you happy for your data to be copied anywhere? Do you need to restrict the services and IP addresses that a query is allowed to call out to? 
  • OpenSSL Vulnerabilities – Do you need to make sure that vulnerabilities in libraries like open SSL have been patched? 
  • Remote Clusters – Do you need to be able to securely read data from another cluster? 
  • Vault namespace, certificate manager, mtls – And finally, if you want to be able to manage all the certificates and keys needed to secure a system without having a nervous breakdown. 

If you answered yes to any of these questions, then take advantage of these security upgrades by making sure you are running on the latest version of the platform. 

It is worth upgrading… Productivity 

It is not just about security, but also valuing your time so as to not waste it chasing down problems that have already been solved. It is always frustrating when someone reports a bug that has already been reported and fixed many months ago. All they needed to do was to upgrade to a more recent version. 

Even if you do not upgrade to the latest new major or minor version, make sure you are on the latest point release of the version that is supported 

  • Many bug fixes – Please see the release notes section below for more detailed information about recent bug fixes 
  • Workarounds for problems elsewhere – Running on the latest version ensures the most optimal experience when using HPCC Systems. For Instance: 
    • Make sure the storage is mounted before the pods are started. And you may find problems if you’re connecting to old LDAP instances, but they may not support the cypher used by modern versions of open SSL. 
      If so, you will need the options in the new versions to control those cyphers so that you can connect securely. 
  • Other new features 
    • Detailed statistics for indexes – The file details for an index now includes information about how much memory those indexes require. 
    • File copy when publishing Roxie packages – When you publish a Roxie package, you have the option to use DFU workunits to update all the files. That makes it easy to track the progress copying and to guarantee all the files are available before the Roxie goes live 
    • HttPCALL form URL-encoding – A user needed to consume the Amazon COGNITO service from within their HPCC query, but the problem was that the cognito service only supports form URL encoding. HTTP call was extended to support that form URL encoding. 
    • Automatic SOAPCALL secret credentials – You are now able to switch to using secrets for HTTPCALL/SOAPCALL out to external services without rewriting your ECL code. 
    • Support HashiCorp vault auth using client certificates – There are many ways of authenticating with HashiCorp vault and managing them across many clusters can be challenging to maintain. There is now the capability to automatically generate vault client certificates using cert-manager and use them to authenticate to vaults. 

For more information on version 9.x and a complete list of the new features and upgrades as well as bug fixes please view the  HPCC Systems 9.x Release notes and also visit the HPCC Systems Red Book which contain tons of HPCC Systems knowledge.