File Storage on the HPCC Systems Cloud Native Platform
The method of defining storage on our Cloud Native platform has been rationalized and simplified in HPCC Systems 8.2.0. This blog takes you through the new way storage is defined in the values.yaml file, providing some examples of how it can be used.
For those migrating values.yaml files from earlier versions, we recommended to first understand the new structure and then read the migration hints at the end.
The Storage Section of the values.yaml File
This section configures the locations that HPCC Systems uses to store all categories of data. Most of that configuration is provided within the list of storage planes.
Each plane has 3 required fields:
As shown in the following simplified list example:
``` storage: planes: - name: dali category: dali prefix: "/var/lib/HPCCSystems/dalistorage" - name: dll category: dll prefix: "/var/lib/HPCCSystems/queries" - name: primarydata category: data prefix: "/var/lib/HPCCSystems/hpcc-data" ```
The Name Property
The name property is used to identify the storage plane in the helm charts. It is also visible to the user, since it is used to identify a storage location within ECL Watch or ECL code. The name must be unique and must not include upper-case characters. It loosely corresponds to a cluster in the bare-metal version of the platform.
category is used to indicate the kind of data that is being stored in that location. Different planes are used for the different categories to isolate the different types of data from each other, but also because they often require different performance characteristics. A named plane may only store one category of data. The following categories are currently supported (with some notes about performance characteristics):
Where are data files generated by HPCC stored? For Thor, storage costs are likely to be significant. Sequential access speed is important, but random access is much less so. For ROXIE, speed of random access is likely to be important.
A landing zone where external users can read and write files. HPCC Systems can import from or export files to a landing zone. Typically performance is less of an issue, it could be blob/s3 bucket storage, accessed either directly or via an NFS mount.
The location of the dali metadata store, which needs to support fast random access.
Where are the compiled ECL queries stored? The storage needs to allow shared objects to be directly loaded from it efficiently.
This is the location where archived workunits, etc are stored and it is typically less speed critical, requiring lower storage costs.
- spill (optional)
Where are spill files from Thor written? Local NVMe disks are potentially a good choice.
- temp (optional)
Where are temporary files written?
Currently temp and spill are not completely implemented, but will be in future point releases. It is likely that other categories will be added in the future (for example, a location to store inter-subgraph spills).
The most common case is where prefix defines the path within the container where the storage is mounted. In the example above, they are all sub-directories of /var/lib/HPCCSystems.
HPCCSystems also allows some file systems to be accessed through a url syntax. For example, the following landing zone uses azure blob storage:
``` storage: planes: - name: azureblobs prefix: "azure://ghallidayblobs@data" category: lz ```
How is storage associated with a storage plane?
So far we have seen the properties that describe how the HPCC Systems application views the storage, but how does Kubernetes associate those definitions with physical storage?
Ephemeral storage: (storageClass, storageSize)
Ephemeral storage is allocated when the HPCC Systems cluster is installed and deleted when the chart is uninstalled. It is useful for providing a clean system for testing and also for a demonstration system to allow you to experiment with the system. It is not so useful for production systems and for this reason the helm chart generates a warning if it is used.
Which storage provisioner should be used to allocate the storage? A blank storage class indicates it should use the default provisioner.
How much memory is required for this storage? This example shows an ephemeral data plane:
``` planes: - name: data storageClass: "" storageSize: 1Gi prefix: "/var/lib/HPCCSystems/hpcc-data" category: data ```
And to add an ephemeral landing zone (which you can use to upload files to via ECL Watch) you could use:
``` planes: - name: mylandingzone storageClass: "" storageSize: 1Gi prefix: "/var/lib/HPCCSystems/mylandingzone" category: lz ```
Persistent storage (pvc)
For persistent storage, the hpcc cluster uses persistent volume claims (pvc) that have already been created and installed by, for example, another Helm chart. Using a pvc allows the lifetime of the data stored on those volumes to be longer than the lifetime of the HPCC Systems cluster that uses them. The helm/examples directory contains charts to simplify defining persistent storage for a local machine, azure, aws, etc.
The pvc property names a Persistent Volume Claim created by another chart.
Default storage planes
The values file can contain more than one storage plane definition for each category. The first storage plane in the list for each category is used as the default location to store that category of data.
The default can be overridden on each component by specifying a property with the name “<category>Plane”. This example shows how to override the default dali storage plane to use daliPlane:
``` eclagent: - name: hthor prefix: hthor dataPlane: premium-data # override the default data plane dali: - name: mydali daliPlane: primary-dali-plane # override the plane to store the dali data ```
Other storage.planes options
- forcePermissions: <boolean>
In some situations the default permissions for the mounted volumes do not allow the hpcc user to write to the storage. Setting this option ensures the ownership of the volume is changed before the main process is started.
- subPath: <string>
This property provides an optional sub-directory within <prefix> to use as the root directory. Most of the time the different categories of data will be stored in different locations and this option is not needed. However, if there is a requirement to store two categories of data in the same location, then it is legal to have two storage planes use the same prefix/path and different categories as long as the rest of the plane definitions are identical (except for the name and the subPath). The subPath property allows the data to reside in separate directories so they cannot clash.
- secret: <string>
This provides the name of any secret that is required to access the plane’s storage. It it currently unused, but may be required once inter-cluster remote file access is finished.
- defaultSprayParts: <number>
Earlier we commented that storage planes are similar to clusters in our bare metal platform. One key difference is that bare metal clusters are associated with a fixed size Thor, whereas a storage plane is not. This property allows you to define the number of parts that a file is split into when it is imported/sprayed. The default is currently 1, but that will soon change to the size of the largest Thor cluster.
This property allows you to specify the costs associated with the storage, so that the platform can calculate an estimate of the costs associated with each file. Currently only the cost at rest is supported, transactional costs will be added later. For example:
``` cost: storageAtRest: 0.113 # Storage at rest cost: cost per GiB/month ```
Bare Metal Storage
There are two aspects to using bare metal storage in the Kubernetes system. The first is the hostGroups entry in the storage section which provides named lists of hosts. The hostGroups entries can take one of two forms.
This is the most common form, and directly associates a list of host names with a name:
``` storage: hostGroups: - name: "The name of the host group process" hosts: [ "a list of host names" ] ```
The second form allows one host group to be derived from another:
``` storage: hostGroups: - name: "The name of the host group process" hostGroup: "Name of the hostgroup to create a subset of" count: "Number of hosts in the subset" offset: "the first host to include in the subset" delta: "Cycle offset to apply to the hosts" ```
Some typical examples with bare-metal clusters are smaller subsets of the host, or the same hosts, but storing different parts on different nodes, for example:
``` storage: hostGroups: - name: groupABCDE # Explicit list of hosts hosts: [A, B, C, D, E] - name groupCDE # Subset of the group last 3 hosts hostGroup: groupABCDE count: 3 offset: 2 - name groupDEC # Same set of hosts, but different part->host mapping hostGroup: groupCDE delta: 1 ```
The second aspect is to add a property to the storage plane definition to indicate which hosts are associated with it. There are two options:
- hostGroup: <name>
The name of the host group for bare metal. For historical reasons the name of the hostgroup must match the name of the storage plane.
- hosts: <list-of-namesname>
An inline list of hosts. Primarily useful for defining one-off external landing zones, for example:
``` storage: planes: - name: demoOne category: data prefix: "/home/gavin/temp" hostGroup: groupABCD # The name of the hostGroup - name: myDropZone category: lz prefix: "/home/gavin/mydropzone" hosts: [ 'mylandingzone.com' ] # Inline reference to an external host. ```
Migrating from earlier versions.
If you are using the default values files, or the example storage helm charts then most of the changes will be hidden under the covers. If you are using a custom values.yaml the following is a list of steps to take to migrate:
- Implicit planes now need to be explicit
For each category of storage that was previously implicit (e.g. using daliStorage), define a storage plane with that category, using the same properties.
- Change the way default planes are defined
- If you only have a single plane for each category this will involve deleting the dataStorage/daliStorage/dllStorage sections. If you have multiple planes and the default is not the first plane then reorder the planes, or explicitly define the default in the component.
- Rename the storagePlane property
Any references to storagePlane to define the default storagePlane for an engine component, should now use dataPlane for consistency.
- Simplify Sasha storage
Previously within a sasha service there was a storage section. If this was used to create an ephemeral plane, then an explicit plane should be defined instead. The name of the Sasha plane is now given via sasha.<service>.plane, rather than sasha.<service>.storage.plane.
Use these resources to find out more about our Cloud Native Platform:
- For more information about the differences between previous versions and HPCC Systems 8.2.0, see this changes document located in the HPCC Systems Github Repository and read this blog providing more details about Features New and Improved in HPCC Systems 8.2.0.
- HPCC Systems Cloud Native Wiki Page – Provides access to blogs, how to videos and links to various GitHub repository resources
- HPCC Systems 8.0.0 – Cloud Native Platform Highlights
- HPCC Systems Helm GitHub Repository – For deploying HPCC Systems under Kubernetes
- Supporting Documentation – Containerized HPCC Systems Platform