Introduction… optimizing field usage

This blog is going to be a bit of an eclectic mix. It will cover the background behind some of the features being added to the platform, what to watch out for when using ECL, and maybe a bit more detail about what goes on under the covers.

The first topic I want to look at is the way fields are optimized, and the motivation behind a new optimization in 4.2….

One of the optimizations that the ecl code generator (eclcc) performs is to reduce the number of fields being processed at each stage to a minimum. Say you have a dataset that contains names, ages and extra details:

nameRecord := RECORD
STRING name;
UNSIGNED1 age;
STRING address;
STRING details;
END;
EXPORT namesTable := DATASET(‘names’, nameRecord, FLAT);

If you want to find the name of the oldest person, you could do it as follows:

sortedNames := SORT(namesTable, -age);
oldest := CHOOSEN(sortedNames, 1);
OUTPUT(oldest, { name });

If the platform executed this exactly as is written, it would read the data for address and extra details from the disk, retain it in memory for the sort, and throw it away at the end. Instead eclcc tracks which fields are required at each stage in the query, and then uses that to minimize the record size at each stage. It translates the previous code into the equivalent of following query:

// the following projection is combined with the disk read
projectedNames := TABLE(namesTable, { name, age });
sortedNames := SORT(projectedNames, -age);
oldest := CHOOSEN(sortedNames, 1);
OUTPUT(oldest, { name });

(Actually it does better than that, but that’s a different topic… try it and see.)

The field projection is performed as part of the activity that reads the data from the disk – which avoids copying and retaining a large amount of data.
For this simple case the ECL programmer could do the work himself, but why waste the effort when it will be done for you? The real power comes when you start building up a complex library of definitions. The fields can often only be removed because of the combination of definitions that are used.

Now let us modify our example a bit. Instead of storing the data in one file, the data is now stored in two flat files which are joined to get the information:

rawNameRecord := RECORD
UNSIGNED id;
STRING name;
UNSIGNED1 age;
END;

extraRecord := RECORD
UNSIGNED id;
STRING address;
STRING details;
END;

rawNamesTable := DATASET(‘names’, rawNameRecord, FLAT);
extraTable := DATASET(‘extra’, extraRecord, FLAT);
EXPORT namesTable := JOIN(rawNamesTable, extraTable, LEFT.id = RIGHT.id, LEFT OUTER);

A LEFT OUTER join is used because the extra information is optional, and only present in the extra table if it is provided. This exported definition of namesTable contains the same information (with an extra id field), but the way it is created is different.

After the field usage has been optimized by eclcc it will become something equivalent to

projectedExtraTable := TABLE(extraTable, { id });
EXPORT namesTable := JOIN(rawNamesTable, extraTable, LEFT.id = RIGHT.id, LEFT OUTER);

That is quite good, but in this case we don’t actually need to do the join (since we’re not using any of the information that the join fills in). The problem is that in general a JOIN generates an output record for each right record that matches a record from the left. So the JOIN cannot be removed automatically because that may change the number of times each record from the left is duplicated. To remove it the code generator needs a hint that there can be at most one match.

And so finally to the new optimization in 4.2…. Adding ATMOST(1) onto the JOIN definition gives it the hint that it needs.

EXPORT namesTable := JOIN(rawNamesTable, extraTable, LEFT.id = RIGHT.id, LEFT OUTER, ATMOST(1));

With this in place eclcc knows it can remove the entire join, and reduces it to the equivalent of

EXPORT namesTable := TABLE(rawNamesTable, { name, age });

Its effect can be even more drastic on complex graphs (as an exercise imagine what COUNT(sortedNamesTable) can be reduced to), and I have seen a significant effect on some queries.

This is actually a common situation, where you are combining data from multiple sources. If there is a 1 to 0 or 1 relationship then include ATMOST(1) on your join. As well as providing useful documentation on the intention to other ECL programmers, you may find whole chunks of your queries disappearing because they’re not needed – especially once you upgrade to version 4.2.

(See issue HPCC-10149 if you want more details of the optimization.)