Persisting data in an HPCC Systems Cloud native environment
If you have not yet experimented with our new cloud native platform, you might like to take a look at the previous blogs in our HPCC Systems Cloud series:
- Richard Chapman (VP & Head of Research & Development, LexisNexis Risk Solutions) has written a blog about HPCC Systems and the Path to Cloud, which gives some background information to this ongoing development project and provides details of how to set up a default cluster using a Helm chart.
- Jake Smith (Enterprise/Lead Architect, LexisNexis Risk Solutions) has written a blog about Setting up a default HPCC Systems cluster on Microsoft Azure Cloud Using HPCC Systems 7.8.x and Kubernetes, which provides detailed instructions and helpful hints in a tutorial style.
These blogs are a good starting points for understanding the new containerized version of the HPCC Systems platform but neither of the systems described how to persist the data.
In this blog Gavin Halliday (Enterprise/Lead Architect, LexisNexis Risk Solutions), walks through how to persist your data. He also provides a temporary solution you can use to get your data in to the system, until a permanent solution becomes available.
So how do you persist the data?
If you are using the standard Helm charts, when a release is uninstalled your data and queries will also disappear. If you are bringing up a system to test or experiment with then that may well be considered to be a feature, but if you want to do any real work it is not very useful.
Kubernetes uses persistent volume claims (pvcs) to provide access to data storage. By default the hpcc Helm charts use pvcs that have the same lifetime as the cluster. They automatically create a data volume when the cluster is installed and the volumes are freed up when the cluster is uninstalled.
If you want your data to persist for longer than the hpcc chart you need to use pvcs that have a longer lifetime (or to use a different mechanism for storing data e.g. S3 buckets/Azure blobs). Storing the hpcc data in blob storage is currently under development, but you can use pvcs today.
Let’s walk through the most recent changes included the HPCC Systems 7.8.x series that make it easy for you to persist the data.
Helm file changes
The hpcc Helm charts have been modified to make it easier to add support for persisting data. (These changes took place after Richard and Jake’s blogs were written, which have now been updated to reflect the new structure.) The information about storage has been restructured in the following way:
global.dataStorage has moved to storage.dataStorage global.dllServer has moved to storage.dllStorage dali[n].storage has moved to storage.daliStorage
The different storage options are now specified consistently. More importantly all the values can be overridden with a –set or –values option on a helm install command.
Example Helm Charts
The Helm charts have also moved. The master for the Helm charts have moved from the dockerfiles directory to a new helm directory. When a release is tagged as RC or Gold, the Helm files are published to the Helm chart repo (https://github.com/hpcc-systems/helm-chart) as tgz files. When the most recent stable version is tagged gold the Helm file sources are also published.
We have added some example charts to the new helm directory (and Helm chart repo) to simplify persisting data in different environments. When these charts are installed they create persistent volume claims (pvcs) which can then be used by the hpcc charts when they are installed. You can download the charts from the Helm chart repo with the following command:
git clone git@github.com:hpcc-systems/helm-chart.git
and then change to the directory containing the sources for the hpcc Helm charts:
cd helm-chart/helm
Persisting data for a cluster running on a local machine
Let’s walk through a couple of common ways to run Kubernetes on a local machine using docker desktop and minikube.
Docker desktop:
This works particularly well with OSX and Windows (especially with the new wsl 2 support in Windows build v2004).
If you have a windows machine and you want data persisted to the c:hpccdata directory, the first step is to make sure the relevant directories exist:
mkdir c:\hpccdata mkdir c:\hpccdata\dlls mkdir c:\hpccdata\data mkdir c:\hpccdata\dali
Next, install the Helm chart from the examples/local directory, which creates persistent volumes based on host directories. The –set option is used to specify the base directory (the path /run/desktop/mnt/host/ provides access to the host file system for wsl2).
helm install localfile examples/local/hpcc-localfile --set common.hostpath=/run/desktop/mnt/host/c/hpccdata
Finally install the hpcc chart, and provide a yaml file that provides storage information that uses the pvcs created by the previous step. The example directory contains a sample yaml file that can be used in this case:
helm install mycluster hpcc/ --set global.image.version=<version> -f examples/local/values-localfile.yaml
The values from examples/local/values-localfile.yaml override corresponding entries in the original hpcc Helm chart. This allows the same base configuration file to be in different environments, by only updating the storage information. That yaml file itself is fairly simple and can easily be adapted to different release names:
storage: dllStorage: existingClaim: "dll-localfile-hpcc-localfile-pvc" forcePermissions: true daliStorage: existingClaim: "dali-localfile-hpcc-localfile-pvc" forcePermissions: true dataStorage: existingClaim: "data-localfile-hpcc-localfile-pvc" forcePermissions: true
With these charts installed, any data created while the system is running will be stored in the c:hpccdata directory and persisted after the hpcc and file charts are uninstalled.
The README.md (https://github.com/hpcc-systems/HPCC-Platform/blob/ffa8a443c03a811ccba3cbc6c738c026a6b70fcf/helm/examples/metrics/README.md) and NOTES.txt file within the examples/local directory provide more information on using and configuring the chart.
Minikube
If you are running on a linux machine then you are likely to be using minikube. (Installation instructions can be found here: https://kubernetes.io/docs/tasks/tools/install-minikube/). The process is very similar to docker desktop, but this time the path you want to store the data in is /home/<username>/hpccdata (~/hpccdata). First make sure the directories exist:
mkdir ~/hpccdata mkdir ~/hpccdata/dlls mkdir ~/hpccdata/data mkdir ~/hpccdata/dali
The next step is specific to minikube. Minikube runs Kubernetes in a virtual machine, therefore the directories from the host need to be mounted into that virtual machine so that containers running in the minikube environment can access them. So you need to run minikube mount command:(1000,999 are the gid/uid of the ‘hpcc’ user inside the containers)
minikube mount /home/<username>/hpccdata:/mnt/hpccdata --gid=10001 --uid=10000
The ~/hpccdata directory is now mounted as /mnt/hpccdata within the minikube VM. The miniikube mount command must stay running for the duration of the later stages.
Next, install the chart that creates persistent volume claims (pvcs), using the mount path in the minikube VM:
helm install localfile examples/local/hpcc-localfile --set common.hostpath=/mnt/hpccdata
Finally install the hpcc chart, overriding the storage information to use the pvcs:
helm install mycluster hpcc/ --set global.image.version=<version> -f examples/local/values-localfile.yaml
Other environments
The examples currently include the following Helm charts (more are planned):
- local – Docker Desktop for osx/windows or minikube
- azure – Persistent volume claims that use the azurefile storage.
Also contains an example values.yaml file for specifying non-persistent azurefile storage. - nfs – For deploying to AWS or Google Cloud
Simply follow the same pattern to use each of them:
- Install a chart to create the persistent volume claims:
helm install <file-release-name> examples/<dir>/<chart> --set storage.x.y=values
- Install the chart for the main hpcc system, supplying a values file to override the storage definitions. The names of the pvcs created when the storage chart is installed depend on the release name that is provided. The values file must use those generated names:
helm install <hpcc-release-name> hpcc --values storage-override.yaml
The data persists as long as the storage chart is installed. You can uninstall the hpcc release, reconfigure it, launch multiple releases and the workunits and data are preserved.
For the local files the data is also preserved after the pvcs are uninstalled. For some of the other examples (e.g. Azure) the pvcs use a retain policy, which means the pvs also remain after the chart is uninstalled with a “Released” Status.
How do I get data into the system?
That still leaves the question of how you can import data into the hpcc system. This is still very much a work in progress. However, there is a short term solution if you are running a system on a local machine.
- Create a subdirectory within your data directory. For example:
mkdir ~/hpccdata/data/import
- Copy the files you want to upload onto the system to that directory:
cp example.csv ~/hpccdata/data/import
- Use the ECL file::ip::filename syntax to read the files directly from the import directory. For example:
nameRec := RECORD STRING firstname; STRING lastname; STRING dob; END; allPeople := DATASET('~file::localhost::var::lib::^H^P^C^C^Systems::hpcc-data::import::example.csv', nameRec, csv); output(allPeople);
It’s not pretty, the filename is ugly, but it works! Needless to say, we are working on a longer term solution, including an easy way to read files from blob/bucket storage.
Expect more changes over the next few weeks….
Additional Notes
- The newer versions of minikube have a –driver=none mode which aims to avoid the virtual machine. That may avoid the need to mount the host volume, but it has not been recently tested with the hpcc charts.
- The step of having to create the nested subdirectories may be removed in a future version.
- Yes, that filename syntax really is horrible.
- Development of our Cloud native platform is ongoing. It is not recommended to use it in production environments but we do want your feedback. Please take it for a test drive and use our Community Issue Tracker to let us know about any issues you find.