Managing Source Code

The Git Repository

Git is free open source software for distributed version control. Git tracks changes for any set of files. With Git every Git directory on every computer is a full-fledged repository with complete history and full version-tracking abilities.

Refer to https://git-scm.com/ for more information.

HPCC Systems has support for the ECLCC Server to compile ECL code directly from Git repositories. The repositories (and optional branches/users) are configured using environment variables on the server. You can submit a query from a repository branch, and the ECLCC Server will pull the source code from a Git repository and compile it. This allows you to deploy a specific version of a query without needing to perform any work on the client.

The Git Improvements

Starting with version 8.4, the platform code for Git support significantly improved. Some of these improvements have been backported to older support releases such, as 7.12. However, You still need to update to a recent point release to ensure you get any of these improvements. While the later releases such as 8.6 will include all of these improvements.

Speed Improvements

The platform code has been upgraded for significant improvements to the speed. Featuring faster compiling from Git repositories without the added overhead when compared with compiling from checked out sources.

Git Resources and Manifests

The HPCC Systems platform now supports Git manifests and resources when compiling.

Git-lfs Support

Git-lfs is an extension to Git that improves support for large files and is supported by both GitHub and GitLab. This extension is particularly useful for large resources. For example, if you have java packages included as part of the manifest.

Multiple Repository Support

The HPCC Systems platform code includes support for using multiple Git repositories. With this multiple repository support the HPCC Systems platform now allows each Git repository to be treated as a separate independent package. Dependencies between the repositories are specified in a package file which is checked into the repository and versioned along with the ECL code. The package file indicates what the dependencies are and which versions should be used.

This approach resolves concerns such as when merging changes from multiple sources into a single repository. In that context it solves issues with incompatible changes, dependencies, or clashes if there are modules with the same name and ensures that the dependencies between repositories are versioned.

Using Git with HPCC

The --main syntax has been extended to allow compiling directly from the repository.

Consider the following command :

ecl run thor --main demo.main@https://github.com/gituser/gch-demo-d#version1 --server=... 

This command submits a query to Thor via ESP. It retrieves ECL code from the 'version1' branch in the https://hithub.com/gituser/gch-demo-d repository. Compiles the code in the demo/main.ecl file and then runs the query on Thor. The checkout will be done on the remote ECLCC Server rather than on the client machine.

Repository Reference Syntax

The syntax for the reference to the repository is as follows:

<protocol:>//<urn>/<user>/<repository>#version

The protocol and urn can be omitted and a default will be used. Such as in the following example:

ecl run thor --main demo.main@gituser/gch-ecldemo-d#version1 --server=...

This command also submits a query to Thor, retrieves ECL code from the 'version1' branch in the gch-demo-d repository. Compiles the code in the demo.main.ecl file and then runs the query on Thor.

Version-Text

The version text that follows the hash (#) in the repository reference can take any of the following forms:

  • The name of a branch

  • The name of a tag

    Note: Currently only lightweight tags are supported. Annotated tags are not yet supported.

  • The secure hash algorithm (SHA) of a commit

To illustrate consider the following commands:

ecl run thor --main demo.main@gituser/gch-ecldemo-d#version1 --server=...

This command will retrieve the demo.main ECL code from the 'version1' branch of the gch-ecldemo-d repository.

ecl run thor --main demo.main@gituser/gch-ecldemo-d#3c23ca0 --server=...

This command will retrieve the demo.main ECL code from the commit with the SHA of '3c23ca0'.

You can also specify the name of a tag utilizing this same syntax.

Checking ECL Syntax

You can use the --syntax option to check the syntax of your code.

The following command checks the syntax of the code in the commit with the SHA of '3c23ca0' of the gch-ecldemo-d repository.

ecl run thor --main demo.main@ghalliday/gch-ecldemo-d#3c23ca0 --syntax 

While the following command would check the syntax of the code in the 'version1' branch of the gch-ecldemo-d repository.

ecl run thor --main demo.main@ghalliday/gch-ecldemo-d#version1 --syntax 

Since the code in a branch could possibly get updated and change - it is a good idea to always check the syntax.

The Package JSON

Consider this package.json file:

{
 "name": "demoRepoC",
 "version": "1.0.0", 
 "dependencies": { 
      "demoRepoD": "gituser/gch-ecldemo-d#version1" 
 } 
} 

The package file gives a name to the package and defines the dependencies. The dependencies property is a list of key-value pairs. The key (demoRepoD) provides the name of the ECL module that is used to access the external repository. The value is a repository reference which uses the same format as the previous examples using the --main syntax.

Use the External Repository in your ECL Code

To use the external repository in your ECL code you need to add an import definition.

IMPORT layout;
IMPORT demoRepoD AS demoD;
 
EXPORT personAsText(layout.person input) :=
    input.name + ': ‘ +
 demoD.format.maskPassword(input.password);

The above example the name demoRepoD in the second IMPORT matches the key value in the package.json file. This code uses the attribute format.maskPassword from the version1 branch from the gituser/gch-ecldemo-d.

Each package is processed independently of any others. The only connection is through explicit imports of the external packages. This is why packages can have modules or attributes with the same name and they will not clash.

Multiple Repository Examples

The following is an example of a package.json file using multiple repositories.

IMPORT layout;
IMPORT demoRepoD_V1 AS demo1;
IMPORT demoRepoD_V2 AS demo2;
 
EXPORT personAsText(layout.person input) :=
'Was: ' + demo1.format.maskPassword(input.password) +
    ' Now: ' + demo2.format.maskPassword(input.password);

Note that the demoRepoD repository _V1 and _V2 are processed independently.

Likewise consider the following example using Query ECL

{
  "name": "demoRepoC",
  "version": "1.0.0",
  "dependencies": {
    "demoRepoD_V1": "gituser/gch-ecldemo-d#version1"
    "demoRepoD_V2": "gituser/gch-ecldemo-d#version2"
  }
}

Noting the dependencies of the branches 'version1' and 'version2' of the gch-ecldemo-d repository.

Command Line Options

Command line options have been added to the ECL and ECLCC commands to leverage these improvements in working with Git repositories.

Local Development Options

The -R option has been added to the eclcc and ecl commands. Set the -R option instruct the compiler to use source from a local directory instead of using source from an external repository.

Syntax:

-R<repo>[#version]=path 

For example:

ecl run examples/main.ecl -Rgituser/gch-ecldemo-d=/home/myuser/source/demod 

This command uses the ECL code for DemoRepoD from /home/myuser/source/demoD rather than https://github.com/gituser/gch-ecldemo-d#version1.

The Verbose Option

The -v option has been improved to provide more verbose output including the details of the Git requests.

You could use the -v option for debugging. For instance, if you have any issues of repositories not resolving. Issue the command as follows with the -v option to analyse the details of the Git requests.

ecl run examples/main.ecl -v -Rgituser/gch-ecldemo-d=/home/myuser/source/demod 

ECL and ECLCC Git Options

These command line options have been added to the ECL and ECLCC commands.

--defaultgitprefix This command line option changes the default prefix that is added to relative packages references. The default can also be configured using the environment variable ECLCC_DEFAULT_GITPREFIX. Otherwise It defaults to "https://github.com/".

--fetchrepos Setting this option tells whether external repositories that have not been cloned locally should be fetched. This defaults to true in 8.6.x. It may be useful to set this option to false if all external repositories are mapped to local directories to verify if they are being redirected correctly.

--updaterepos Updates external repositories that have previously been fetched locally. This option defaults to true. It is useful to set this option to false if you are working in a situation with no access to the external repositories, or to avoid the overhead of checking for changes if you know there aren't any.

ECLCC_ECLREPO_PATH The directory the external repositories are cloned to. On a client machine this defaults to: <home>/.HPCCSystems/repos (or %APPDATA%\HPCCSystems\repos on windows). You can delete the contents of this directory to force a clean download of all repositories.

Helm Chart Configuration Options

These are Helm chart options for configuring Git values for cloud deployments. The following values are now supported for configuring the use of Git within Helm charts for HPCC Systems cloud deployments.

eclccserver.gitUsername - Provides the Git user name

secrets.git - Define the secrets.git to allow repositories to be shared between queries, to be able to cache and share the cloned packages between instances.

eclccserver.gitPlane - This options defines the storage plane that external packages are checked out and cloned to.

For example

eclccserver:
- name: myeclccserver
  #...
- gitPlane: git/sample/storage

If the gitPlane option is not supplied, the default is the first storage plane with a category of Git - otherwise ECLCC Server uses the first storage plane with a category of dll.

Security and Authentication

If external repositories are public, such as bundles, then there are no further requirements. Private repositories have the additional complication of requiring authentication information - either on the client or on the ECLCC Server depending on where the source is gathered. Git provides various methods for providing these credentials.

Client Machine Authentication

These are the recommended approaches for configuring the credentials on a local system that is interacting with a remote GitHub.

  • github authentication Download the GitHub command line toolkit. You can then use it to authenticate all Git access with the following command:

    gh auth login 

    This is probably your best option if you are using GitHub. More details can be found on:

    https://cli.github.com/manual/gh_auth_login

  • ssh key In this scenario, the ssh key associated with a local developers machine is registered with the GitHub account. This is used when the GitHub reference is of the form of ssh://github.com.

    The sshkey can be protected with a passcode and there are various options to avoid having to enter the passcode each time. For more information see:

    https://docs.github.com/en/authentication/connecting-to-github-with-ssh/about-ssh

  • Use a personal access token These are similar to a password, but with additional restrictions on their lifetime and the resources that can be accessed. Here are the details on how to to create them. They can then be used with the various git credential caching options.

    An example can be found here:

    https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage

Generally, for authentication it is preferrable to use the https:// protocol instead of the ssh:// protocol for links in package-lock.json files. If the ssh:// is used it requires any machine that processes the dependency to have access to a registered ssh key. That can sometimes cause avoidable issues.

All of these Authentication options are likely to involve some user interaction, such as passphrases for ssh keys, web interaction with GitHub authentication, and initial entry for cached access tokens. This is problematic for the ECLCC Server which cannot support user interaction, and since it is preferrable not to pass credentials around. The solution therefore is to use a personal access token securely stored as a secret. This token could then be associated with a special service account, which would then securely initiate these transactions. The secret then avoids the need to pass credentials and allows the keys to be rotated.

Kubernetes Secrets

This section describes secrets support in the Kubernetes (and bare metal) versions of the HPCC Systems platform.

To add secrets support:

  1. . Add the gitUsername property to the eclccserver component of your customization yaml file:

       eclccserver:
       - name: myeclccserver
         gitUsername: gituser
    

    Note: the eclccserver.gitUsername value should match your git user name.

  2. Add a secret to the customization yaml file, with a key that matches the gitUsername

    secrets:
         git: 
           gituser: my-git-secret
    
  3. Add the secret to Kubernetes containing the personal access token:

    apiVersion: v1
    kind: Secret
    metadata:
      name: my-git-secret
    type: Opaque
    stringData:
      password: ghp_eZLHeuoHxxxxxxxxxxxxxxxxxxxxol3986sS=
    

    Note password contains the personal access token.

  4. Apply the secret to your Kubernetes using the kubectl command:

    kubectl apply -f ~/dev/hpcc/helm/secrets/my-git-secret

    When a query is submitted to the ECLCC Server, any git repositories are then accessed using this configured user name and password.

  5. Store the secret in a vault. You can also store the PAT (personal access token) inside a vault.

Bare Metal Credentials

This section describes credentials for bare metal systems. Bare metal systems require some similar configuration steps.

  1. Add the gitUsername property to the EclCCServerProcess entry in the environment.xml file.

    <EclCCServerProcess daliServers="mydali"
                          ...
                          gitUsername="gitguser“
    
  2. Push out the environment.xml to all nodes.

  3. Either store the credentials as secrets or store in a vault.

    As secrets:

    Store the access token in:

    /opt/HPCCSystems/secrets/git/<user-name>/password

    For example:

    cat /opt/HPCCSystems/secrets/git/gitusr/password
    ghp_eZLHeuoHxxxxxxxxxxxxxxxxxxxxol3986sS=
    

    Or for a vault:

    You can store inside a vault. You can now define a vault within the Software section of the environment. For example:

    <Environment>
     <Software>
       ...
       <vaults>
        <git name='my-storage-vault' url="http://127.0.0.1:8200/v1/secret/data/git/${secret}" 
    kind="kv-v2" client-secret="myVaultSecret"/>
        ...
       </vaults>
       ...
    

    Note that the above entries have the same exact content as the corresponding entries in the kubernetes values.yaml file.