Exporting Data from an HPCC Systems bare metal platform to our Cloud Native platform
The most recent versions of the HPCC Systems platform allow you to run a system on Kubernetes in the cloud and version 8.0.0 will be suitable for production testing when it is available in the next few months. But to test whether a system is suitable for production you often need some real life data. How do you get data from an existing bare metal system onto the cloud so you can begin some real representative testing?
This blog covers the steps to take to export files to a Kubernetes system using azure blob storage. (Details for other cloud providers will follow later).
Note: This process does not require access from the cloud system to the bare metal system.
The process makes use of a few changes that have been recently added to the HPCC Systems platform to make this possible. You will need the most recent master or 7.12.x builds of the platform on both the bare metal and cloud systems to successfully follow these steps.
In brief the steps are:
- Export the raw data files from the bare metal system to blob storage
- Export the information about the files from the bare metal system
- Import the meta data into a cloud system.
Configuring the bare metal and Cloud system for file transfer
Follow these steps:
1. Add a drop zone for the azure blob storage
A dropzone should be added in the environment.xml file for the bare metal system:
<DropZone build="_" directory="azure://mystorageaccount@data" name="myazure" ECLWatchVisible="true" umask="022"> <ServerList name="ServerList" server="127.0.0.2"/> </DropZone>
This uses a fake localhost ip address (127.0.0.2) to be able to identify the dropzone when exporting the data.
2. Add a secret for the blob storage
Save the azure storage account access key into the file:
/opt/HPCCSystems/secrets/storage/azure-mystorageaccount/key
This will then be used when the Azure storage account mystorageaccount is accessed.
3. Define a storage plane to the cloud system
The following section should be added to the helm values file:
storage: planes: ... - name: azureblobs prefix: azure://mystorageaccount@data secret: azure-mystorageaccount
4. Publish a secret
Register a file containing the key used to access the storage account as a secret with Kubernetes:
kubectl apply -f secrets/myazurestorage.yaml
With the following definition of secrets/myazurestorage.yaml:
apiVersion: v1 kind: Secret metadata: name: myazurestorage type: Opaque stringData: key: <base64-encoded-access-key>
And add the following definition to the helm values file to ensure the Kubernetes secret is associated with the appropriate logical secret name within the hpcc system:
secrets: storage: azure-mystorageaccount: myazurestorage
More information about configuring storage and persisting data is available in the following blogs:
- Persisting Data in an HPCC Systems Cloud Native Environment
- Configuring Storage in the Cloud Native HPCC Systems Platform
- HPCC Systems Cloud Native Platform – Importing Data
Transferring the Data
1. Get a list of files to export
One possibility, especially for a small system, is to use a dfuplus command to get a list of files and superfiles (with an optional name pattern). for example:
dfuplus action=list server=<src-esp> [name=<filename-mask>]
2. Export the meta definition
For each of the files and superfiles we need to export the metadata to a local file.
dfuplus action=savexml server=<src-esp> srcname=<logical-filename> dstxml=<metafile-name>
The same command works for both files and super files , although you will want to import them differently.
3. Export the data from the bare metal system to the azure blob storage
The despray command can be used to copy a logical file to an external location:
dfuplus action=despray server=<src-esp> srcname=<logical-filename> dstip=127.0.0.2 wrap=1 transferBufferSize=4194304
This command line makes use of a some recent changes:
-
- The wrap=1 options ensures that the file parts are preserved as they are copied.
- The destination filename is now optional and if it is omitted it is derived from the source filename. Exporting:
a::b::c.xml
will write to the file:
a/b/c.xml._<n>_of_<N>
-
- The transferBufferSize is specified because it defaults to 64K in old environment files, which significantly reduces the throughput for large files.
4. Import each of the file definitions
Register the metdata for each of the files with the cloud system (which will now need to be running).
dfuplus action=add server=<cloud-esp> dstname=<logical-filename> srcxml=<metafile-name> dstcluster=azureblobs
There is a new dfuplus option which allows you to specify where the physical files are found. This should be set to the name of the blob storage plane, which using this example, would be azureblobs. If the physical files do not exist in the correct places then this will fail.
5. Import each of the super file definitions.
Finally once all the files have been imported, superfiles can be added:
dfuplus action=add server=<cloud-esp> dstname=<logical-filename> srcxml=<metafile-name>
The syntax is the same as importing the definition for a logical file, but there is no need to override the cluster.
Example Batch File
This sample batch file illustrates how to process a list of files and superfiles and perform all of the dfu commands required in steps 2-5 above:
#!/bin/bash FilesToSpray=( regress::local::dg_fetchindex1 ... ) srcserver=localhost tgtserver=192.168.49.2:31056 newplane=azureBlobs # The following is useful for checking the ips hav been configured correctly echo "Source contains `dfuplus server=${srcserver} action=list "*" | wc -w` files (${srcserver})" echo "Target contains `dfuplus server=${tgtserver} action=list "*" | wc -w` files (${tgtserver})" echo "Copying `echo ${FilesToSpray[@]} | wc -w` files from ${srcserver}" echo "Press <newline> to continue" read # Iterate through the files for file in "${FilesToSpray[@]}"; do #Export the meta data to a file dfuplus action=savexml server=$srcserver srcname=$file dstxml=export.$file.xml if ! grep -q SuperFile export.$file.xml; then #A logical file => export it echo dfuplus action=despray server=$srcserver srcname=$file dstip=127.0.0.2 wrap=1 transferBufferSize=4194304 dfuplus action=despray server=$srcserver srcname=$file dstip=127.0.0.2 wrap=1 transferBufferSize=4194304 fi done #Add the remote information for the raw files for file in "${FilesToSpray[@]}"; do if ! grep -q SuperFile export.$file.xml; then echo dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane fi done #Now add the superfile information for file in "${FilesToSpray[@]}"; do if grep -q SuperFile export.$file.xml; then echo super: dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml fi done
#Add the remote information for the raw files for file in "${FilesToSpray[@]}"; do if ! grep -q SuperFile export.$file.xml; then echo dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dstcluster=$newplane fi done #Now add the superfile information for file in "${FilesToSpray[@]}"; do if grep -q SuperFile export.$file.xml; then echo super: dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml dfuplus action=add server=$tgtserver dstname=$file srcxml=export.$file.xml fi done