File layout resolution at compile time

by Richard Chapman

When reading a disk file in ECL, the layout of the file is specified in the ECL code. This allows the code to be compiled to access the data very efficiently, but can cause issues if the file on disk is actually using a different layout. In particular, it can present a challenge to the version control process, if you have ECL queries that are being changed to add functionality, but which need to be applied without modification to datafiles whose layout is changing on a different timeline.

We have had a partial solution to this dilemma available in Roxie for index files for a while, with the ability to apply runtime translation from the fields in the physical index file to the fields specified in the index. However it suffered from significant potential overhead and was not available for flat files or on Thor. Until now…

A new feature, coming soon in the HPCC Systems 6.4.0 release, allows file resolution to be performed at compile time, which provides the following advantages: 

  1. Code changes can be insulated from file layout changes – you only need to declare the fields you actually want to use from a datafile.
  2. File layout mismatches can be picked up earlier.
  3. The compiler can use information about file sizes to guide code optimization decisions. 

There are two language constructs associated with this feature. Firstly, a DATASET declaration can be given a LOOKUP attribute to indicate that the filename should be looked up at compile time: 

 myrecord := RECORD
STRING field1;
STRING field2;
END;
f := DATASET(‘myfilename’, myrecord, FLAT);  
// This will fail at runtime if file layout does not match myrecord
f := DATASET(‘myfilename’, myrecord, FLAT, LOOKUP);
  // This will automatically project from the actual to the requested layout

If we assume that the actual layout of the file on disk is:

 myactualrecord := RECORD
STRING field1;
STRING field3;
END;

Then the effect of the LOOKUP attribute will be as if you had typed: 

actualfile := DATASET(‘myfilename’, myactualrecord, FLAT); 
f := PROJECT(actualfile, TRANSFORM(myrecord, SELF := LEFT; SELF := []));

Fields that are present in both record structures are assigned across, fields that are present only in the disk version are dropped and fields that are present only in the ECL version receive their default value (a warning will be issued in this latter case).

The LOOKUP attribute can be given a parameter (true or false) to allow easier control of where and when you want this translation to be done. Any boolean expression that can be evaluated at compile time can be supplied. There is also a workunit option (translateDFSlayouts) that can be used to default translation for all files. In this case you may want to use LOOKUP(false) to override the default on some specific datasets.

This feature takes care of the reading side of things and should make life a lot easier for scenarios where file layouts and ECL code have been tricky to keep in sync. 

But what about those fields that were present in the original and were dropped? What if I want to write out a file that matches the layout of an existing file, but I don’t know exactly what that layout is? 

That’s where the second new language construct comes in.

The LOOKUP attribute can also be used in the RECORDOF function, taking a filename rather than a dataset. The result will be expanded at compile time to the record layout stored in the named file’s metadata. There are several forms of this construct:

 RECORDOF(‘myfile’, LOOKUP);
RECORDOF(‘myfile', defaultstructure, LOOKUP);
RECORDOF(‘myfile’, defaultstructure, LOOKUP, OPT);

You can also specify a DATASET as the first parameter instead of a filename (a syntactic convenience) and the filename specified on the dataset will be used for the lookup.

The defaultstructure is useful for situations where the file layout information may not be available (e.g. when syntax-checking locally, or creating an archive), or when the file being looked up may not exist (this is where ,OPT should be used). The compiler will check that the actual record structure retrieved from the distributed file system lookup contains all the fields specified, and that the types are compatible.

For example, to read a file whose structure is unknown other than that it contains an ID field, and create an output file containing all records that matched a supplied value, you could write:

 myfile := dataset(‘myinputfile’, RECORDOF(‘myinputfile’, { STRING id },                                                         LOOKUP), FLAT);
filtered := myfile(id=‘123’);
output(filtered,,’myfilteredfile’);

Some nuts and bolts

  • The new syntax has been designed so that it is not necessary to perform file resolution in order to be able to syntax-check or create archives, which is important for local-repository mode to work.
  • There are some new parameters to eclcc that can be used if you want to use this functionality for local compiles:
    PARAMETER DESCRIPTION
    -dfs=ip Use specified Dali IP for filename resolution
    -scope=prefix Use specified scope prefix in filename resolution
    -user=id Use specified username in filename resolution
    -password=xxx Use specified password in filename resolution (blank to prompt)

    All of this should be taken care of automatically when using eclccserver or eclserver.

  • Foreign file resolution should work the same way – just use the standard filename syntax for foreign filename resolution.
  • The LOOKUP attribute can also be used on INDEX declarations as well as DATASET. When using the RECORDOF form and supplying a default layout, you may need to use the => form of the record layout syntax to specify both keyed and payload fields in the same record.
  • Files that have been sprayed rather than created via prior ECL jobs may not have record information available in the distributed file system.