Exporting Data from an HPCC Systems bare metal platform to our Cloud Native platform

The most recent versions of the HPCC Systems platform allow you to run a system on Kubernetes in the cloud and version 8.0.0 will be suitable for production testing when it is available in the next few months. But to test whether a system is suitable for production you often need some real life data. How do you get data from an existing bare metal system onto the cloud so you can begin some real representative testing?

This blog covers the steps to take to export files to a Kubernetes system using azure blob storage. (Details for other cloud providers will follow later).

Note: This process does not require access from the cloud system to the bare metal system.

The process makes use of a few changes that have been recently added to the HPCC Systems platform to make this possible. You will need the most recent master or 7.12.x builds of the platform on both the bare metal and cloud systems to successfully follow these steps.

In brief the steps are:

  • Export the raw data files from the bare metal system to blob storage
  • Export the information about the files from the bare metal system
  • Import the meta data into a cloud system.

Configuring the bare metal and Cloud system for file transfer

Follow these steps:

1. Add a drop zone for the azure blob storage

A dropzone should be added in the environment.xml file for the bare metal system:

<DropZone build="_" 
          directory="azure://mystorageaccount@data" 
          name="myazure" 
          ECLWatchVisible="true" 
          umask="022"> 
          <ServerList name="ServerList" server="127.0.0.2"/> 
</DropZone>

This uses a fake localhost ip address (127.0.0.2) to be able to identify the dropzone when exporting the data.

2. Add a secret for the blob storage

Save the azure storage account access key into the file:

/opt/HPCCSystems/secrets/storage/azure-mystorageaccount/key

This will then be used when the Azure storage account mystorageaccount is accessed.

3. Define a storage plane to the cloud system 

The following section should be added to the helm values file:

storage:
  planes: 
  ... 
  - name: azureblobs
    prefix: azure://mystorageaccount@data 
    secret: azure-mystorageaccount

4. Publish a secret

Register a file containing the key used to access the storage account as a secret with Kubernetes:

kubectl apply -f secrets/myazurestorage.yaml

With the following definition of secrets/myazurestorage.yaml:

apiVersion: v1 
kind: Secret 
metadata: 
  name: myazurestorage 
type: Opaque 
stringData: 
  key: <base64-encoded-access-key>

And add the following definition to the helm values file to ensure the Kubernetes secret is associated with the appropriate logical secret name within the hpcc system:

secrets: 
  storage: 
    azure-mystorageaccount: myazurestorage

More information about configuring storage and persisting data is available in the following blogs:

Transferring the Data

1. Get a list of files to export

One possibility, especially for a small system, is to use a dfuplus command to get a list of files and superfiles (with an optional name pattern). for example:

dfuplus action=list server=<src-esp> [name=<filename-mask>]

2. Export the meta definition

For each of the files and superfiles we need to export the metadata to a local file.

dfuplus action=savexml server=<src-esp> srcname=<logical-filename> dstxml=<metafile-name>

The same command works for both files and super files , although you will want to import them differently.

3. Export the data from the bare metal system to the azure blob storage

The despray command can be used to copy a logical file to an external location:

dfuplus action=despray server=<src-esp> srcname=<logical-filename> dstip=127.0.0.2 wrap=1 transferBufferSize=4194304

This command line makes use of a some recent changes:

    • The wrap=1 options ensures that the file parts are preserved as they are copied.
    • The destination filename is now optional and if it is omitted it is derived from the source filename. Exporting:
      a::b::c.xml

      will write to the file: 

      a/b/c.xml._<n>_of_<N>


    • The transferBufferSize is specified because it defaults to 64K in old environment files, which significantly reduces the throughput for large files.

 

4. Import each of the file definitions

Register the metdata for each of the files with the cloud system (which will now need to be running).

dfuplus action=add server=<cloud-esp> dstname=<logical-filename> srcxml=<metafile-name> dstcluster=azureblobs

There is a new dfuplus option which allows you to specify where the physical files are found. This should be set to the name of the blob storage plane, which using this example, would be azureblobs. If the physical files do not exist in the correct places then this will fail.

5. Import each of the super file definitions.

Finally once all the files have been imported, superfiles can be added:

dfuplus action=add server=<cloud-esp> dstname=<logical-filename> srcxml=<metafile-name>

The syntax is the same as importing the definition for a logical file, but there is no need to override the cluster.

Example Batch File

This sample batch file illustrates how to process a list of files and superfiles and perform all of the dfu commands required in steps 2-5 above:

#!/bin/bash

FilesToSpray=(
regress::local::dg_fetchindex1 
... 
) 

srcserver=localhost 
tgtserver=192.168.49.2:31056 
newplane=azureBlobs 

# The following is useful for checking the ips hav been configured correctly 
echo "Source contains `dfuplus server=${srcserver} action=list "*" | wc -w` files (${srcserver})" 
echo "Target contains `dfuplus server=${tgtserver} action=list "*" | wc -w` files (${tgtserver})" 
echo "Copying `echo ${FilesToSpray[@]} | wc -w` files from ${srcserver}" 
echo "Press <newline> to continue" 
read 

# Iterate through the files 
for file in "${FilesToSpray[@]}"; do 
    #Export the meta data to a file 
    dfuplus action=savexml server=$srcserver srcname=$file dstxml=export.$file.xml 
    if ! grep -q SuperFile export.$file.xml; then 
        #A logical file => export it 
        echo dfuplus action=despray server=$srcserver srcname=$file dstip=127.0.0.2 wrap=1 transferBufferSize=4194304 
        dfuplus action=despray server=$srcserver srcname=$file dstip=127.0.0.2 wrap=1 transferBufferSize=4194304 
    fi 
done 

#Add the remote information for the raw files 
for file in "${FilesToSpray[@]}"; do 
    if ! grep -q SuperFile export.$file.xml; then 
        echo dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane
        dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane 
    fi 
done 

#Now add the superfile information 
for file in "${FilesToSpray[@]}"; do 
    if grep -q SuperFile export.$file.xml; then 
      echo super: dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml
        dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml 
    fi 
done

#Add the remote information for the raw files for file in "${FilesToSpray[@]}"; do   if ! grep -q SuperFile export.$file.xml; then   echo dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane   dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane   fi done #Now add the superfile information for file in "${FilesToSpray[@]}"; do   if grep -q SuperFile export.$file.xml; then   echo super: dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml   fi done

 

Tags