Mon Mar 19, 2018 12:19 pm
Login Register Lost Password? Contact Us

An introduction to optimizations

A place for developers to ask questions about contributing to the open source code base

Wed Apr 25, 2012 4:24 pm Change Time Zone

The following is copied from a hpcc-dev post and provides some background on optimizations. It was written with one particular optimization in mind, but should serve as a useful example for discussing some of the issues involved in adding new optimizations to the code generator.

An introduction to optimization

With the introduction of the new operator
DATASET(<count>,transform(COUNTER)) there is an opportunity to use it to optimize some existing code. In particular a construct that has previously been used for generating test data is

simpleRow := DATASET([0], { unsigned value }); values := NORMALIZE(simpleRow, <count>, transform(COUNTER));

This can now be replaced with

values := DATASET(<count>, transform(COUNTER));

Different optimizers
There are two main expression optimizers in the system hqlfold and hqlopt.
Which should the code for the new optimization go in?

hqlfold is used for scalar optimizations. It is also used for optimizations which can be guaranteed to reduce the amount of work. (See example below)

hqlopt is used for dataset optimizations. One key difference is that it keeps track of the number of times an expression is used to stop common expressions becoming duplicated.

For example with the following ECL:

ds1 := DATASET('x', rec, thor);
ds2 := SORT(ds1, fieldx);
r1 := ds2(fieldy == 1);

In general it is more efficient to filter a dataset before it is sorted, so the system contains an optimization to ensure that filters are moved before sorts. However if it is done on this example you would end up with

ds1 := DATASET('x', rec, thor);
ds2 := SORT(ds1, fieldx);
r1 := ds1(fieldy == 1);
r1b := SORT(r1, fieldx);

which has duplicated the sort activity. To avoid this hqlopt keeps track of the number of times each dataset expression is referenced, and only moves filters over sort activities if the sort isn't shared.

[Occasionally some of the complex transforms in hqlfold can cause scalar expressions to become duplicated, but generally hqlfold is relatively conservative, and the examples are few and far between. (Ideally there would be a scalar optimizer which also kept track of the number of times an expression was used to allow some more exotic scalar optimizations.)]

An example of a dataset optimization that IS done in hqlfold is removing a sort of a sort:

ds1 := DATASET('x', rec, thor);
ds2 := SORT(ds1, fieldx);
ds3 := SORT(ds2, fieldy);

Optimizing to...

ds1 := DATASET('x', rec, thor);
ds3 := SORT(ds1, fieldy);

If ds2 is used by another definition then the new graph will have two independent sorts - which could be done in parallel. If it isn't then you have removed a sort from your execution graph. Since neither outcome makes it worse the optimization can be made without checking if the datasets are shared.

Which optimizer?
So for the DATASET() optimization where should it go?

simpleRow := DATASET([0], { unsigned value }); values := NORMALIZE(simpleRow, <count>, transform(COUNTER));
values := DATASET(<count>, transform(COUNTER));

This optimization merges two nodes, so if there is also code that uses simpleRow, you will potentially duplicate its evaluation - once in simpleRow, and once as part of the new dataset operation. That would argue for the code going inside hqlopt rather than hqlfold. However, the cost of evaluating that datarow is unlikely to be very high, and often the transform for the normalize will not make any reference to LEFT - so there will be no actual duplication of simpleRow.

Strictly speaking the optimization should go in hqlfold for cases where no values from the input dataset are used in the transform, or only constant values from the row. The general case could be implemented in hqlopt. In practice duplicating the row is highly unlikely to create extra work, so in this case good enough (and simplest) to always implement in hqlfold. This has the advantage that multiple NORMALIZES of the same input row will be converted (hqlopt would prevent this because the row would be shared).

(Note expression commoning up means it is realtively likely the input row is commoned up across independent attributes.)

The optimization

The code should probably be place inside IHqlExpression * CExprFolderTransformer::doFoldTransformed(IHqlExpression * unfolded, IHqlExpression * original)

Adding a case statement for the no_normalize. There are several different forms of row that we should match:

a) DATASET(ROW(transform));
b) DATASET([transform()]);
c) DATASET([value],{ myfield });

These have the following graph representations:

a) no_datasetfromrow(no_createrow(no_transform(...)))

b) no_inlinetable(no_transformlist(no_transform(...)))

c) Already converted into form (b) when the expression tree is normalized.

The optimization need to check the child graph is of one of those forms, and extract the transform. It then needs to substitute the values from that transform for any references to LEFT.
That is achieved by creating a new graph node
and then calling
replaceSelector(expression, left-expression, new-row-expression) on the NORMALIZE's transform.
(no_newrow indicates that you are replacing a selector (i.e. a cursor onto an active dataset) with an explicit row).

replaceSelector() returns a new transform which can be used as the parameter to the new DATASET(count, transform) operation. The code also needs to inherit the attribute that uniquely identifies the counter from the NORMALIZE.

The other situation that can be optimized (d) is where the input dataset has a single row, and the transform doesn't use any values from LEFT. (E.g., DATASET(GLOBAL(row))).
Using hasSingleRow() and exprReferencesDataset(transform, left) you can determine if that condition is met.

You need to add regression tests for cases a,b,c,d ensuring you have cases which do and don't get transformed.
Community Advisory Board Member
Community Advisory Board Member
Posts: 178
Joined: Wed May 18, 2011 9:48 am

Wed Apr 25, 2012 6:38 pm Change Time Zone

Hi Gavin,

Is the COUNTER operator coming in a later version? I just tried it with the latest release:

Code: Select all

When I syntax check I get:

Error: Unknown identifier "COUNTER" (1, 37), 2167,

Community Advisory Board Member
Community Advisory Board Member
Posts: 975
Joined: Wed Jun 29, 2011 7:13 pm

Thu Apr 26, 2012 8:45 am Change Time Zone

This isn't the correct forum to ask that question - this forum should be restricted to questions about modifying the system source.

P.S. 3.8 is the first version that should include it.
Community Advisory Board Member
Community Advisory Board Member
Posts: 178
Joined: Wed May 18, 2011 9:48 am

Return to Contributors

Who is online

Users browsing this forum: No registered users and 1 guest