Did you really mean SEQUENTIAL?

ECL has two commonly known keywords for grouping together actions – PARALLEL and SEQUENTIAL.  Surely it is fairly obvious which one to use?  Use PARALLEL if you do not mind in which order operations are executed and use SEQUENTIAL if you do.  Unfortunately many people do not realise that SEQUENTIAL can be very bad for your query’s health.

ECL code is generally assumed to not contain any side effects.  For instance if you count the number of rows in a file, we assume it will always produce the same result wherever in the query it is counted.  However some ECL code can break this assumption – for example if you count the rows in a super file, update the superfile, and then count the number of rows again.  SEQUENTIAL was originally introduced to help solve this problem.  E.g.

SEQUENTIAL(
    process a super-file,
    spray a new file,
    update super file to include it,
    process the super-file
)

To ensure that the second stage of processing the super file does not use any results from before the superfile was updated the SEQUENTIAL statement stops anything being shared between the different actions.  No datasets, no computed values, nothing.  For example for the query

SEQUENTIAL(
    OUTPUT(count(myDataset)),
    OUTPUT(CHOOSEN(myDataset, 10))
)

The query will evaluate myDataset twice.  The effect of not sharing any code can be draconian if the only requirement is for the final actions to be performed in the correct order.

So in addition to those two commonly known keywords there is a third less well known keyword – ORDERED.  This ensures the listed actions are *executed* in the correct order, but still allows their inputs to be commoned up and evaluated in any order.   So the ECL query

ORDERED(
    OUTPUT(count(myDataset)),
    OUTPUT(CHOOSEN(myDataset, 10))
);

will only evaluate myDataset once.

Why should you care?  It matters because the difference between SEQUENTIAL and ORDERED can be dramatic, especially where the actions share large quantities of code.  For one query I saw recently, a global replace of SEQUENTIAL with ORDERED reduced the size of the generated query from 556Mb to 76Mb and, perhaps more importantly, the compile time dropped from 13 minutes to just under 2!

Even when SEQUENTIAL is being used correctly there are still a couple more items to look out for.  It is worth minimizing the use of SEQUENTIALs whenever possible.  For example, another situation I have seen is

SEQUENTIAL(doA, doB, doC, doD, doE)

where the only real requirement is that doA must be done first, and doE last.  Unfortunately the sequential is also preventing doB, doC and doD from sharing anything.  The following ECL code is likely to be much more efficient:

SEQUENTIAL(doA, PARALLEL(doB, doC, doD), doE);

And even better as

ORDERED(a, PARALLEL(b,c,d), e);

Finally, if you are repeating a similar process multiple times it is often much better to structure the ECL so that similar work is done together.  For example process all the files, spray all the files, update all the super files, and then process the updated superfiles, rather than process, spray update, process for each superfile in turn.

In summary, use PARALLEL whenever possible, use ORDERED when you need actions to occur in a particular order, and only use SEQUENTIAL if it is really required – or of course if you desperately need the time to enjoy a relaxed cup of coffee while your query compiles.

Footnotes:

The only exception to sharing results is workflow actions e.g., independent – which are shared, and are evaluated before the first sequential action that requires them.

A second alternative to SEQUENTIAL is to use WHEN to trigger an action when another action succeeds.  E.g.

WHEN(processDataset, email(‘it worked!’),SUCCESS);