Why SKIPing can be bad for your query
The most common way of filtering out records is to apply a filter to a dataset. E.g.,
Centenarians := myDataset(age >= 100)
However, sometimes the easiest place to decide whether or not a record is required is within the transform that creates it. ECL has the SKIP keyword to indicate that a record shouldn’t be generated from a transform1. SKIP has two different syntaxes:
1. SKIP as an attribute on a transform
outRecord process(inRecord l) := TRANSFORM,SKIP(l.age >= 100) …
Here the transform doesn’t generate a record if the skip condition is true.
2. SKIP as a side-effect.
outRecord process(inRecord l) := TRANSFORM somethingComplex := f(l); SELF.id := IF(somethingComplex, l.id, SKIP); SELF.next := CASE(l.kind, 1=>l.x, 2=>l.y, 3=>l.z, SKIP); …
Here the transform doesn’t generate a record if the SKIP expression is evaluated. This is often used when the expression being tested is complex, and the output record is being rejected as an exception.
However this form also comes with a cost. The problem is that because the assignment to the field next has the side-effect of potentially skipping, the code generator cannot optimize the field away if it isn’t subsequently used.
Both forms of SKIP can also prevent the transform from being merged with subsequent activities2.
So what are the alternatives?
For the situations where nothing else will do (e.g., SKIP on a JOIN transform), it is much better to use the 1st form than the second – because it doesn’t prevent output fields being optimized away. But what if the filter condition is something complex and you don’t want to duplicate the ECL? You can continue to share common definitions by defining a function that returns a transform. E.g.,
process(inRecord l) := FUNCTION somethingComplex := f(l); outRecord t := TRANSFORM, SKIP(NOT somethingComplex) SELF.id := l.id; … END; RETURN t; END;
You can also implement something similar with a complex test condition – if you can use an exception value for the case that will be skipped:
process(inRecord l) := FUNCTION nextValue := CASE(l.kind, 1=>l.x, 2=>l.y, 3=>l.z, 0); // 0 isn’t otherwise a valid value. outRecord t := TRANSFORM, SKIP(nextValue = 0) SELF.next := nextValue; … END; RETURN t; END;
For a PROJECT you are probably better off removing the SKIP from the transform, and filtering the resulting dataset instead. That will allow the filter condition to potentially be migrated through the query (skips are not currently migrated outside a transform).
In summary… If you need to use SKIP try and use the first form. If you can replace SKIP with a filter you are likely to end up with even better code.
(1) In some activities like ITERATE, SKIP can have the subtly different meaning of don’t process this record.
(2) A couple of reasons why: If the activities are different then SKIP may mean something different, and when SKIP is used as a side-effect, combining transforms may cause the SKIP to be lost.