Skip to main content

See the Introduction for this article here.

Previous chapter: Step 2: The Distributed Flag, and Execution Tests.

This step is based on the following commit:

Git commit: Optimise NORMALIZE to DATASET
https://github.com/hpcc-systems/HPCC-Platform/pull/2314/files

Recapitulating, we have created a new syntax that can run distributed in any number of nodes of the cluster. This is good, but not many people will be using it straight away, and there's a lot of old code that will still be performing sub-optimally on similar tasks that we can help speed up. Such is the case of NORMALIZE.

So, we start by coding in ECL all the things that we'd like to see transformed into the new syntax and all the things that we know we can't (for any reason) and put it into a test case ('ecl/regress/normalize-dataset-opt.ecl').

As you can see, we actually created two files, one for the compiler regression [1] and one for the engine regression suite[2], which should cover the basics on both fronts. The results key file was created after all changes were done and verified to be correct, and executed once to create the output file.

Where to Change?

As mentioned before, there are two main files that handle optimisation in the ECL compiler: 'hqlfold.cpp', which mainly works with simple in-place transformation to fold complex expressions into simpler ones, and 'hqlopt.cpp', which does more complex graph analysis to modify code based on ECL semantics. Our change is simple, and should go into 'hqlfold.cpp'.

The entry point for the folding is 'foldHqlExpression()', which creates an 'CExprFolderTransformer' and call 'transformRoot', which will walk the graph and find all possible ways of folding the child expressions. There are numerous possible paths for the code, depending on which type of expression it is and in which order and format they appear, but ultimately, all of them will call one common method, 'CExprFolderTransformer::doFoldTransformed'.

This method contains a big switch statement with all possible types of nodes with an implementation for each one of them. If we add our node here, wherever they appear in the ECL graph, the code will get here and hopefully folded.

The Change

Since we're looking to fold 'no_normalize' into 'no_dataset_from_transform', we have to introduce a case for no_normalize. In this case, we need to think of all possibilities, which to avoid, and how to fold the rest.

One thing we want to avoid is folding NORMALIZE that uses datasets with more than one row as inputs. That's because our new syntax doesn't account for that, so we'd have to come up with an implementation semantically equivalent, and that would complicate things (and invalidate our previous changes).

Another change we don't want to think of (at least for now), is the kind of NORMALIZE where the counter of the transform refers to the dataset's LEFT. That would make it harder to predict (at compile time) what kind of dependencies are there between the source and the sink.

if (!hasSingleRow(ds) || exprReferencesDataset(count, left))
break;

Now, if the transform in the NORMALIZE has no mention of a dataset (LEFT), our task is ludicrously simple:

From:
NORMALIZE(,10,transform(COUNTER))
To:
DATASET(10,transform(COUNTER))

We achieve that by removing the first child of the NORMALIZE and transforming the node into a DATASET:

First, we put the children into an array, starting from the second one (1):

HqlExprArray args;
unwindChildren(args, expr, 1); // (count, trans)

Then, we return a dataset of type 'no_dataset_from_transform':

return createDataset(no_dataset_from_transform, args);

Complications

If the transform does reference the dataset, we need to be careful. Since we're sure the dataset is not bigger than one line, and we're sure that the dataset is not too complex (the early aborts above), we can replace the LEFT references in the transform by the fields in the dataset. So, 'LEFT.foo' becomes 'ds[1].foo', or the value from the first row.

Remember that LEFT references are like COUNTER references, in so much as they point to a source. COUNTERs point to number sources while LEFT (and RIGHT) point to a data source in a remote dataset or transform.

So all we need to do is to update the reference to point directly to the source instead of its pointer. It's much like collapsing a pointer-to-pointer to a direct pointer to the object.

This will tell you if the transform references the dataset's selector:

IHqlExpression * transform = expr->queryChild(2);
OwnedHqlExpr left = createSelector(no_left, ds, querySelSeq(expr));
if (exprReferencesDataset(transform, left))

Now you need some ECL knowledge to know what to expect. (Remember, you can always ask on the list or rely on the code review process). Some examples of datasets that will trigger this transformation are:

DATASET(ROW(transform))
DATASET([transform()])
DATASET([value],{ myfield })

The first one is a 'no_datasetfromrow' and the two last are 'no_inlinetable'.

A dataset from row, as expected, has the row directly linked to the dataset as the first child. So, simply adding that row as the source for the new transform selector is enough:

case no_datasetfromrow:
IHqlExpression * row = ds->queryChild(0);
if (row->getOperator() == no_createrow)
newRow.set(row);
break;
}

The second syntax accepts a list of transforms, so we need to be careful. And once we have the list, we need to create a row from which to pick the selector:

case no_inlinetable:
{
IHqlExpression * transformList = ds->queryChild(0);
newRow.setown(createRow(no_createrow, LINK(transformList->queryChild(0))));
break;
}

The LINK here is a Link-Counted smart pointer. You need to be careful with those but we won't cover its usage in this tutorial.

Finally, we may have (or not) a new row in hand. If we do, we change the selector from the original transform for this new row and discard the original dataset and transform:

if (!newRow)
break;
OwnedHqlExpr replacementRow = createRow(no_newrow, LINK(newRow));
newTransform.setown(replaceSelector(transform, left, replacementRow));

In essence, we have created a new transform that does not reference the original dataset, but its own selector is the selector of the dataset itself. Now, we just need to replace the old transform by this one when creating the new dataset syntax.

if (newTransform)
args.replace(*newTransform.getClear(), 1);

Note that there are lots of 'break's in the code and we check if we actually created new constructs all the time. These checks will abort on the slight chance, and make sure we don't optimise something that we weren't expecting.

In the end, the new expression will only be returned if we're sure it's correct. If not, it'll just return the same expression, so the caller can compare if it's the same (ie. not optimised) and continue.

Side Effects

When changing the compiler, and trying to run the regression suite, you'll see that sometimes the compiler fails on some files, which is easy to fix, but sometimes it'll produce slightly different results, which can be very hard to fix. It's often the case that one small change triggers another bigger, that itself triggers another and so on.

Luckily, in our case, there wasn't a big change, but we found that some PROJECTs weren't cleaning unused fields in a DATASET when that node type was 'no_dataset_from_transform'.

A quick look on the compiler files, and some questions on the list, gets us to 'hqliproj.cpp', which does the project transformations. The core function is 'ImplicitProjectTransformer::createTransformed'.

If you remember that our dataset is a source (not a transformation), we need to look for nodes similar to ours. Most of the 'activityKind's in the main switch are related to transformations, or sinks or records. There is one, however that is called 'CreateRecordSourceActivity', deals with 'no_inlinetable' and has the following comment:

//Always reduce things that create a new record so they only project the fields they need to

Bingo! But looking at the code, you see that the inline table has a list of transforms, while our dataset has only a transform. So we need to do the same thing with no_dataset_from_transform, but getting a transform directly instead.

Also, we need to make sure nodes of type 'no_dataset_from_transform' get marked correctly as 'CreateRecordSourceActivity', and for that, we find where it is set for 'no_inlinetable', which is 'ImplicitProjectTransformer::getProjectExprKind'.

Final testing

With these changes, the un-optimised PROJECTs end up again optimised, and all looks good. Run the compiler test in Thor or Roxie and see the output. If it makes sense, and you got all the corner cases you could think of, record the output as a key file and add it to the regression test (both ECL and XML files), and it's time to submit a pull request.

References

See reference [1] and [2] here.

Follow-up

The next step on the tutorial is Step 4: Inlining.