Persisting data in an HPCC Systems Cloud native environment

If you have not yet experimented with our new cloud native platform, you might like to take a look at the previous blogs in our HPCC Systems Cloud series:

These blogs are a good starting points for understanding the new containerized version of the HPCC Systems platform but neither of the systems described how to persist the data.

In this blog Gavin Halliday (Enterprise/Lead Architect, LexisNexis Risk Solutions), walks through how to persist your data. He also provides a temporary solution you can use to get your data in to the system, until a permanent solution becomes available.

So how do you persist the data?

If you are using the standard Helm charts, when a release is uninstalled your data and queries will also disappear.  If you are bringing up a system to test or experiment with then that may well be considered to be a feature, but if you want to do any real work it is not very useful.

Kubernetes uses persistent volume claims (pvcs) to provide access to data storage.  By default the hpcc Helm charts use pvcs that have the same lifetime as the cluster.  They automatically create a data volume when the cluster is installed and the volumes are freed up when the cluster is uninstalled.

If you want your data to persist for longer than the hpcc chart you need to use pvcs that have a longer lifetime (or to use a different mechanism for storing data e.g. S3 buckets/Azure blobs).  Storing the hpcc data in blob storage is currently under development, but you can use pvcs today.

Let’s walk through the most recent changes included the HPCC Systems 7.8.x  series that make it easy for you to persist the data.

Helm file changes

The hpcc Helm charts have been modified to make it easier to add support for persisting data.  (These changes took place after Richard and Jake’s blogs were written, which have now been updated to reflect the new structure.)  The information about storage has been restructured in the following way:

global.dataStorage has moved to storage.dataStorage
global.dllServer has moved to storage.dllStorage
dali[n].storage has moved to storage.daliStorage

The different storage options are now specified consistently.  More importantly all the values can be overridden with a –set or –values option on a helm install command.

Example Helm Charts

The Helm charts have also moved.  The master for the Helm charts have moved from the dockerfiles directory to a new helm directory.  When a release is tagged as RC or Gold, the Helm files are published to the Helm chart repo (https://github.com/hpcc-systems/helm-chart) as tgz files.  When the most recent stable version is tagged gold the Helm file sources are also published.

We have added some example charts to the new helm directory (and Helm chart repo) to simplify persisting data in different environments.  When these charts are installed they create persistent volume claims (pvcs) which can then be used by the hpcc charts when they are installed.  You can download the charts from the Helm chart repo with the following command:

 git clone git@github.com:hpcc-systems/helm-chart.git

and then change to the directory containing the sources for the hpcc Helm charts:

 cd helm-chart/helm

Persisting data for a cluster running on a local machine

Let’s walk through a couple of common ways to run Kubernetes on a local machine using docker desktop and minikube.

Docker desktop:

This works particularly well with OSX and Windows (especially with the new wsl 2 support in Windows build v2004).

If you have a windows machine and you want data persisted to the c:hpccdata directory, the first step is to make sure the relevant directories exist:

mkdir c:\hpccdata
mkdir c:\hpccdata\dlls
mkdir c:\hpccdata\data
mkdir c:\hpccdata\dali

Next, install the Helm chart from the examples/local directory, which creates persistent volumes based on host directories.  The –set option is used to specify the base directory (the path /run/desktop/mnt/host/ provides access to the host file system for wsl2).

helm install localfile examples/local/hpcc-localfile --set common.hostpath=/run/desktop/mnt/host/c/hpccdata

Finally install the hpcc chart, and provide a yaml file that provides storage information that uses the pvcs created by the previous step.  The example directory contains a sample yaml file that can be used in this case:

helm install mycluster hpcc/ --set global.image.version=<version> -f examples/local/values-localfile.yaml

The values from examples/local/values-localfile.yaml override corresponding entries in the original hpcc Helm chart.  This allows the same base configuration file to be in different environments, by only updating the storage information.  That yaml file itself is fairly simple and can easily be adapted to different release names:

storage:
  dllStorage:
    existingClaim: "dll-localfile-hpcc-localfile-pvc"
    forcePermissions: true

  daliStorage:
    existingClaim: "dali-localfile-hpcc-localfile-pvc"
    forcePermissions: true

  dataStorage:
    existingClaim: "data-localfile-hpcc-localfile-pvc"
    forcePermissions: true

With these charts installed, any data created while the system is running will be stored in the c:hpccdata directory and persisted after the hpcc and file charts are uninstalled.

The README.md (https://github.com/hpcc-systems/HPCC-Platform/blob/ffa8a443c03a811ccba3cbc6c738c026a6b70fcf/helm/examples/metrics/README.md) and NOTES.txt file within the examples/local directory provide more information on using and configuring the chart.

Minikube

If you are running on a linux machine then you are likely to be using minikube.  (Installation instructions can be found here: https://kubernetes.io/docs/tasks/tools/install-minikube/). The process is very similar to docker desktop, but this time the path you want to store the data in is /home/<username>/hpccdata (~/hpccdata).  First make sure the directories exist:

mkdir ~/hpccdata
mkdir ~/hpccdata/dlls
mkdir ~/hpccdata/data
mkdir ~/hpccdata/dali

The next step is specific to minikube.  Minikube runs Kubernetes in a virtual machine, therefore the directories from the host need to be mounted into that virtual machine so that containers running in the minikube environment can access them.  So you need to run minikube mount command:(1000,999 are the gid/uid of the ‘hpcc’ user inside the containers)

minikube mount /home/<username>/hpccdata:/mnt/hpccdata --gid=10001 --uid=10000

The ~/hpccdata directory is now mounted as /mnt/hpccdata within the minikube VM.  The miniikube mount command must stay running for the duration of the later stages.

Next, install the chart that creates persistent volume claims (pvcs), using the mount path in the minikube VM:

helm install localfile examples/local/hpcc-localfile --set common.hostpath=/mnt/hpccdata

Finally install the hpcc chart, overriding the storage information to use the pvcs:

helm install mycluster hpcc/ --set global.image.version=<version> -f examples/local/values-localfile.yaml

Other environments

The examples currently include the following Helm charts (more are planned):

  • local – Docker Desktop for osx/windows or minikube
  • azure – Persistent volume claims that use the azurefile storage.
    Also contains an example values.yaml file for specifying non-persistent azurefile storage.
  • nfs   – For deploying to AWS or Google Cloud

Simply follow the same pattern to use each of them:

  1. Install a chart to create the persistent volume claims:
helm install <file-release-name> examples/<dir>/<chart> --set storage.x.y=values
  1. Install the chart for the main hpcc system, supplying a values file to override the storage definitions.  The names of the pvcs created when the storage chart is installed depend on the release name that is provided.  The values file must use those generated names:
helm install <hpcc-release-name> hpcc --values storage-override.yaml

The data persists as long as the storage chart is installed.  You can uninstall the hpcc release, reconfigure it, launch multiple releases and the workunits and data are preserved.

For the local files the data is also preserved after the pvcs are uninstalled.  For some of the other examples (e.g. Azure) the pvcs use a retain policy, which means the pvs also remain after the chart is uninstalled with a “Released” Status.

How do I get data into the system?

That still leaves the question of how you can import data into the hpcc system.  This is still very much a work in progress.  However, there is a short term solution if you are running a system on a local machine.

  1. Create a subdirectory within your data directory. For example:
 mkdir ~/hpccdata/data/import
  1. Copy the files you want to upload onto the system to that directory:
 cp example.csv ~/hpccdata/data/import
  1. Use the ECL file::ip::filename syntax to read the files directly from the import directory. For example:
nameRec := RECORD
    STRING  firstname;
    STRING  lastname;
    STRING  dob;
END;

allPeople := DATASET('~file::localhost::var::lib::^H^P^C^C^Systems::hpcc-data::import::example.csv', nameRec, csv);

output(allPeople);

It’s not pretty, the filename is ugly, but it works!  Needless to say, we are working on a longer term solution, including an easy way to read files from blob/bucket storage.

Expect more changes over the next few weeks….

Additional Notes

  • The newer versions of minikube have a –driver=none mode which aims to avoid the virtual machine.  That may avoid the need to mount the host volume, but it has not been recently tested with the hpcc charts.
  • The step of having to create the nested subdirectories may be removed in a future version.
  • Yes, that filename syntax really is horrible.
  • Development of our Cloud native platform is ongoing. It is not recommended to use it in production environments but we do want your feedback. Please take it for a test drive and use our Community Issue Tracker to let us know about any issues you find.