File Storage on the HPCC Systems Cloud Native Platform

The method of defining storage on our Cloud Native platform has been rationalized and simplified in HPCC Systems 8.2.0.  This blog takes you through the new way storage is defined in the values.yaml file, providing some examples of how it can be used.

For those migrating values.yaml files from earlier versions, we recommended to first understand the new structure and then read the migration hints at the end.

The Storage Section of the values.yaml File

This section configures the locations that HPCC Systems uses to store all categories of data.  Most of that configuration is provided within the list of storage planes.

Each plane has 3 required fields:

  • name
  • category
  • prefix

As shown in the following simplified list example:

```
storage:
  planes:
  - name: dali
    category: dali
    prefix: "/var/lib/HPCCSystems/dalistorage"
  - name: dll
    category: dll
    prefix: "/var/lib/HPCCSystems/queries"
  - name: primarydata
    category: data
    prefix: "/var/lib/HPCCSystems/hpcc-data"
``` 

The Name Property

The name property is used to identify the storage plane in the helm charts. It is also visible to the user, since it is used to identify a storage location within ECL Watch or ECL code. The name must be unique and must not include upper-case characters. It loosely corresponds to a cluster in the bare-metal version of the platform.

Category

category is used to indicate the kind of data that is being stored in that location.  Different planes are used for the different categories to isolate the different types of data from each other, but also because they often require different performance characteristics.  A named plane may only store one category of data.  The following categories are currently supported (with some notes about performance characteristics):

  • data
    Where are data files generated by HPCC stored?  For Thor, storage costs are likely to be significant.  Sequential access speed is important, but random access is much less so.  For ROXIE, speed of random access is likely to be important.
  • lz
    A landing zone where external users can read and write files.  HPCC Systems can import from or export files to a landing zone.  Typically performance is less of an issue, it could be blob/s3 bucket storage, accessed either directly or via an NFS mount.
  • dali
    The location of the dali metadata store, which needs to support fast random access.
  • dll
    Where are the compiled ECL queries stored?  The storage needs to allow shared objects to be directly loaded from it efficiently.
  • sasha
    This is the location where archived workunits, etc are stored and it is typically less speed critical, requiring lower storage costs.
  • spill (optional)
    Where are spill files from Thor written?  Local NVMe disks are potentially a good choice.
  • temp (optional)
    Where are temporary files written?

Currently temp and spill are not completely implemented, but will be in future point releases.  It is likely that other categories will be added in the future (for example, a location to store inter-subgraph spills).

Prefix

The most common case is where prefix defines the path within the container where the storage is mounted.  In the example above, they are all sub-directories of /var/lib/HPCCSystems.

HPCCSystems also allows some file systems to be accessed through a url syntax.  For example, the following landing zone uses azure blob storage:

```
storage:
  planes:
  - name: azureblobs
    prefix: "azure://ghallidayblobs@data"
    category: lz
```

How is storage associated with a storage plane?

So far we have seen the properties that describe how the HPCC Systems application views the storage, but how does Kubernetes associate those definitions with physical storage?

Ephemeral storage: (storageClass, storageSize)

Ephemeral storage is allocated when the HPCC Systems cluster is installed and deleted when the chart is uninstalled.  It is useful for providing a clean system for testing and also for a demonstration system to allow you to experiment with the system.  It is not so useful for production systems and for this reason the helm chart generates a warning if it is used.

  • storageClass:
    Which storage provisioner should be used to allocate the storage?  A blank storage class indicates it should use the default provisioner.
  • storageSize:  
    How much memory is required for this storage? This example shows an ephemeral data plane:
```
  planes:
  - name: data
    storageClass: ""
    storageSize: 1Gi
    prefix: "/var/lib/HPCCSystems/hpcc-data"
    category: data
```

And to add an ephemeral landing zone (which you can use to upload files to via ECL Watch) you could use:

```
  planes:
  - name: mylandingzone
    storageClass: ""
    storageSize: 1Gi
    prefix: "/var/lib/HPCCSystems/mylandingzone"
    category: lz
```

Persistent storage (pvc)

For persistent storage, the hpcc cluster uses persistent volume claims (pvc) that have already been created and installed by, for example, another Helm chart.  Using a pvc allows the lifetime of the data stored on those volumes to be longer than the lifetime of the HPCC Systems cluster that uses them.  The helm/examples directory contains charts to simplify defining persistent storage for a local machine, azure, aws, etc.

  • pvc
    The pvc property names a Persistent Volume Claim created by another chart.

Default storage planes

The values file can contain more than one storage plane definition for each category.  The first storage plane in the list for each category is used as the default location to store that category of data.

The default can be overridden on each component by specifying a property with the name “<category>Plane”.  This example shows how to override the default dali storage plane to use daliPlane:

```
eclagent:
- name: hthor
  prefix: hthor
  dataPlane: premium-data               # override the default data plane
dali:
- name: mydali
  daliPlane: primary-dali-plane         # override the plane to store the dali data
```

Other storage.planes options

  • forcePermissions: <boolean>  
    In some situations the default permissions for the mounted volumes do not allow the hpcc user to write to the storage.  Setting this option ensures the ownership of the volume is changed before the main process is started.
  • subPath: <string>  
    This property provides an optional sub-directory within <prefix> to use as the root directory.  Most of the time the different categories of data will be stored in different locations and this option is not needed.  However, if there is a requirement to store two categories of data in the same location, then it is legal to have two storage planes use the same prefix/path and different categories as long as the rest of the plane definitions are identical (except for the name and the subPath).  The subPath property allows the data to reside in separate directories so they cannot clash.
  • secret: <string>  
    This provides the name of any secret that is required to access the plane’s storage.  It it currently unused, but may be required once inter-cluster remote file access is finished.
  • defaultSprayParts: <number>  
    Earlier we commented that storage planes are similar to clusters in our bare metal platform.  One key difference is that bare metal clusters are associated with a fixed size Thor, whereas a storage plane is not.  This property allows you to define the number of parts that a file is split into when it is imported/sprayed.  The default is currently 1, but that will soon change to the size of the largest Thor cluster.
  • cost:  
    This property allows you to specify the costs associated with the storage, so that the platform can calculate an estimate of the costs associated with each file.  Currently only the cost at rest is supported, transactional costs will be added later.  For example:
```
      cost:
        storageAtRest: 0.113              # Storage at rest cost: cost per GiB/month
```

Bare Metal Storage

There are two aspects to using bare metal storage in the Kubernetes system. The first is the hostGroups entry in the storage section which provides named lists of hosts. The hostGroups entries can take one of two forms.

This is the most common form, and directly associates a list of host names with a name:

```
storage:
  hostGroups:
  - name: "The name of the host group process"
    hosts: [ "a list of host names" ]
```

The second form allows one host group to be derived from another:

```
storage:
  hostGroups:
  - name: "The name of the host group process"
    hostGroup: "Name of the hostgroup to create a subset of"
    count: "Number of hosts in the subset"
    offset: "the first host to include in the subset"
    delta:  "Cycle offset to apply to the hosts"
```

Some typical examples with bare-metal clusters are smaller subsets of the host, or the same hosts, but storing different parts on different nodes, for example:

```
storage:
  hostGroups:
  - name: groupABCDE              # Explicit list of hosts
    hosts: [A, B, C, D, E]
  - name groupCDE                 # Subset of the group last 3 hosts
    hostGroup: groupABCDE
    count: 3
    offset: 2
  - name groupDEC                 # Same set of hosts, but different part->host mapping
    hostGroup: groupCDE
    delta: 1
```

The second aspect is to add a property to the storage plane definition to indicate which hosts are associated with it.  There are two options:

  • hostGroup: <name>
    The name of the host group for bare metal.  For historical reasons the name of the hostgroup must match the name of the storage plane.
  • hosts: <list-of-namesname>  
    An inline list of hosts.  Primarily useful for defining one-off external landing zones, for example:
```
storage:
  planes:
  - name: demoOne
    category: data
    prefix: "/home/gavin/temp"
    hostGroup: groupABCD             # The name of the hostGroup
  - name: myDropZone
    category: lz
    prefix: "/home/gavin/mydropzone"
    hosts: [ 'mylandingzone.com' ]  # Inline reference to an external host.
```

Migrating from earlier versions.

If you are using the default values files, or the example storage helm charts then most of the changes will be hidden under the covers.  If you are using a custom values.yaml the following is a list of steps to take to migrate:

  • Implicit planes now need to be explicit
    For each category of storage that was previously implicit (e.g. using daliStorage), define a storage plane with that category, using the same properties.
  • Change the way default planes are defined  
  • If you only have a single plane for each category this will involve deleting the dataStorage/daliStorage/dllStorage sections.  If you have multiple planes and the default is not the first plane then reorder the planes, or explicitly define the default in the component.
  • Rename the storagePlane property
    Any references to storagePlane to define the default storagePlane for an engine component, should now use dataPlane for consistency.
  • Simplify Sasha storage  
    Previously within a sasha service there was a storage section.  If this was used to create an ephemeral plane, then an explicit plane should be defined instead.  The name of the Sasha plane is now given via sasha.<service>.plane, rather than sasha.<service>.storage.plane.

More Information

Use these resources to find out more about our Cloud Native Platform: