Compiler Tutorial, Step 1: The Parser, The Expression Tree and the Activity

See the Introduction for this article here.

This part of the tutorial refers to the commit bellow:

Git commit: DATASET (N, transform(COUNTER))

This step has many sub-steps, but since we must keep consistency, we can’t just add features in one place (say, the parser) without adding it to the rest. So, this pull request changes all places needed to add a new syntax, although a syntax that doesn’t add new activities (so, no changes in the engines).

Feel free to have a peak at the code now and realise that there are not many changes to do. The problem, however, is to know where and how to change. At hindsight, it always looks simple.

The Parser

Finding the Place

The parser file, as any Yacc file, is very hard to read, understand the connections between tokens and even harder to make changes without breaking it. Anyone that has even tried to change a Yacc file will tell you the same, so don’t worry if the first 10 attempts to change it fails miserably.

The first step is to add it to the Parser, so new ECL can be transformed into an expression tree inside the compiler. To do so, we must open the ‘hqlgram.y’ file and look for similar syntax.

Since ECL syntax, like in most languages, uses parenthesis to aggregate arguments, you have to look for “DATASET ‘(‘” pattern to see where the other DATASET constructs are declared. Remember, we’re trying to find similar syntax to be able to reuse or copy code from other constructs to our own, so the closer we get to what we want, the better.

The first matches (from the top of the file) won’t give you the right place. Since DATASET() has many uses, we need to skim over the false-positives. Ex:

DATASET '(' recordDef ')' only declares a potential dataset from a record
DATASET '(' dataSet ')' same, but using another dataset's record
VIRTUAL DATASET '(' recordDef ')' nothing in there...
DATASET '(' recordDef childDatasetOptions ')' only adding a field...

Then, around line (8170), you get to:

DATASET '(' thorFilenameOrList ',' dsRecordDef ...

That's big, ugly and doing a lot of things, lets see what it is.

thorFilenameOrList is an expression, that's promising...

Initially, an 'expression' was used, but 'thorFilenameOrList' (maybe not the best name) can cope other features (for example, initialization with lists of rows, file names and expressions), that removes the ambiguity of the parser.

So that's a good place to start looking at the code inside and see what it does. If you look at each implementation, you'll see how they manipulate data and how they convert the ECL code into the abstract syntax tree (AST). Example:

origin = $3.getExpr();
attrs = createComma(createAttribute(_origin_Atom, origin), $8.getExpr());

The third token gets into the 'origin' expression, 'attrs' is a "comma" (ie. a list of expressions) containing an "origin" Atom referencing 'origin' and the 8th token in the list.

Atoms are identifiers, or known properties. That code is saying that it knows a thing called "origin" and the associated expression is 'origin'. This is used by the tree builder and optimiser later on to identify features, filter and common up expressions based on those attributes.

Adding a new DATASET

Now it's time to think what we need to do. We need to create a dataset, from a transform, using a counter. Since you've got a counter to handle, we need to think that are the consequences of having one. Go to the language reference doc and look for other constructs that have counters, and you'll find many, for example, NORMALIZE. Searching for "NORMALIZE '('" in the parser file, you'll find a curious token:

beginCounterScope / endCounterScope

Scopes of counters are necessary when you have nested transforms and each one has its own counter. You can't mix them, or you'll get wrong results (ie. an internal counter being updated by an external increment). You'll also see that the "counter" object is bound to the end of scope token.

A counter is not a number, but a reference to a number source. In our case, the "count" is the number source, which can be anything from a constant to an expression to a reference to an external object. Following the NORMALIZE case, you get to something like this:

DATASET '(' thorFilenameOrList ',' beginCounterScope transform endCounterScope

That represents a DATASET(expr, TRANSFORM(COUNTER)). Which can be implemented as:

IHqlExpression * counter = $7.getExpr(); // endCounterScope
if (counter)
counter = createAttribute(_countProject_Atom, counter);

As you have seen in the NORMALIZE case, you must name it a "counter", by creating an attribute with a count Atom to it. The name "countProject" is maybe not the best, though.


With this you create a dataset, naming the operation "no_dataset_from_transform" (that needs to be added to the list of operations, you'll see later), with the counter source expression ($3) and a "comma" with the transform ($6) using a counter. The comma is necessary because the createDataset() accepts a list of operands.

If you follow the NORMALIZE and other DATASET examples, you'll see that you also need to normalise the numeric input, to make sure it's a valid integer:

parser->normalizeExpression($3, type_int, false);

and update the parser's context position to the current expression:


However, just adding that code created ambiguities on the other DATASET constructs. This is because of the counter object conflicting with other declarations. One way to fix it (yacc style) was to add the counter object to all conflicting constructs and introduce an error message. So, we're moving the fact that counters are not meaningful on those other constructs, from a syntax error to a semantic error. Check the pull request on 'hqlgram.y' for more information on that.


Now that we have a node in our AST being built by the parser, we need to support it. The first thing you do is to the node operator's enum, in 'hqlexpr.hpp'. Find the first 'unusedN' item and replace it with your own:


Unused items were once real nodes that got removed from the language. To keep backward compatibility with old compiled code, we keep the same order and id of most of our enums. Adding a new node at the end would also work, but would leave a huge trail of unused ones in the middle.

You can now try to compile your code and execute the compiler regression[1] to make sure you haven't introduced any new bug. If all is green, create an example of what you expect to see in ECL:

r t(unsigned value) := transform
SELF.i := value;
ds := DATASET(10, t(COUNTER));

This should create a dataset with items from 1 to 10. If you run this example through your code, you'll see that it'll fail in multiple places. That's because the compiler has no knowledge of what to do with your node. One way to do that is to keep running that code through your compiler and fixing the failures (ie. adding no_dataset_from_transform cases to switches with appropriate code).

What is appropriate code? Well, that varies. If you see in the pull request, there are many places where the new node was added, and each had a different case. Examples:

@ bool definesColumnList(IHqlExpression * dataset)
case no_dataset_from_transform: return true;
@ const char *getOpString(node_operator op)
case no_dataset_from_transform: return "DATASET";
@ childDatasetType getChildDatasetType(IHqlExpression * expr)
case no_dataset_from_transform: return childdataset_none;

How do you know these things? Normally, it's either obvious or very hard to answer. Putting the wrong value, in this case, won't do much for the other constructs (since we're restricting to no_dataset_from_transform), but it may give you the wrong sensation of success. It might work for the cases you have predicted, but it might fail for others, or simply be wrong.

Our suggestion is to use a mix of trying for yourself and asking on the list, but rest assured that, if you ever get right results with bad code, we should be able to spot it on pull request reviews. Adding comments to the code, to make that task easier is a work in progress.

One quick example of ECL's AST folding (more to come on next step), is to fold null datasets. In the pull request above, check the file 'hqlfold.cpp'. It's checking for a zero sized count (via 'isZero', which checks if the expression evaluates to zero, not if the value is a constant zero), and replaces it with a null expression (via 'replaceWithNull').

The Activity

The AST will be translated into activities. These activities are coded in C++ by the compiler's back-end, collected into a driver that will execute the graphs in order and compiled again to a shared object. Each shared object is a workunit, that are executed in the HPCC engines (Roxie, Thor).

Each activity has a helper, and that's the class you'll have to implement from the AST node, so the activities in the engines can use it to execute your code. You need to find out what existing activity (if there is one) maps to the same functionality as your node. In our case, we're very lucky that there is one activity perfect for this job, but the question is, how to find it?

All activities implement the IHThorActivity interface, so you can consult the 'ecl/hthor/hthor.hpp' file and list all base classes of that interface (aka. pure virtual struct). I assume your IDE does that for you, if not, 'hthor.cpp' and 'hthor.ipp' will give you plenty of material to read.

If you look thoroughly enough, or ask an experienced developer, you'll find out that 'CHThorTempTableActivity' is the activity you're looking for. The reason being that this activity builds a new dataset from scratch (ie. it's a source) and it does so in a simple way. This activity is generated by the compiler and also the syntax:

DATASET(my-set, { single-fielded-record }).

If there is no activity that could possibly be mapped to this new AST node, you will have to create one and make all engines implement it. Choosing which one to derive from follows the same logic described above, though.

Now that we know what activity we're aiming towards, we'll try to export our AST node to it, so it'll automatically be executed by all engines. To do so, you must add a new 'BuildActivity' in the ECL-to-CPP translator. The translator is implemented by the class 'HqlCppTranslator' in 'ecl/hqlcpp/hqlhtcpp.cpp'.

You need to add a hook in 'buildActivity' to call your function when the case is a no_dataset_from_transform. Following the convention in the rest of the class, we called it 'doBuildActivityCountTransform'. Your new function will probably accept the context and the expression node.

Building an Exporter

An exporter function will generate a helper class, that derives from your base object implemented from the interface named above. Our activity's helper is an 'IHThorTempTableArg' which is partially implemented by 'CThorTempTableArg'.

Your class will have to base 'CThorTempTableArg' to re-use its generic methods and to have the link-counted logic that all classes do. The main methods that 'CThorTempTableArg' haven't implemented from 'IThorTempTableArg' are:

size32_t getRow(ARowBuilder & rowBuilder, unsigned row);
unsigned numRows();

'numRows' will return the number of rows, so the activity can stop at the right time or allocate the right amount of memory beforehand. 'getRow' will return the next row in line, and in our case, update the internal counter.

So, first we need to create a "TempTable" activity instance:

Owned instance =
new ActivityInstance(*this, ctx, TAKtemptable, expr,"TempTable");

and override the functions we need. See that we're passing the context (where in the resulting C++ file we're writing to) to the activity, so our instance has its own context. From now on, we have to re-use the context of the activity to write any code in, otherwise it'll be written on the outside of the class and we would get a compilation error.

A simple function is to check if the activity is constant. Normally it is, for temp tables, so the default behaviour is true. But in our case, the transform might not be constant (depend on external factors which we can't predict), so we need to change it accordingly:

// bool isConstant() - default is true
if (!isConstantTransform(transform))
doBuildBoolFunction(instance->startctx, "isConstant", false);

See that we're using a 'doBuildBoolFunction' function, which is a wrapper to create a new function that returns a boolean from the result of the expression, in our case, "false".

This means, we only override the function on those activities that really need and that we know at compile time. If we didn't know it, we would have to pass an expression to be evaluated, and override that function every time. Of course, compile-time evaluation is always better, so in this case, we use it.

Overriding 'numRows' is also simple, just by passing the count (whether an expression or a constant), and the wrapper will take care of the rest:

// unsigned numRows() - count is guaranteed by lexer
doBuildUnsignedFunction(instance->startctx, "numRows", count);

However, 'getRow' is somewhat more complicated. We have to build an expression to account for the counter's increment, but also make sure we keep it attached to the counter object (that, if you remember, is just a reference to a number), so the transform can use it. We have to do the function building ourselves.

Start a new context (inside the activity's context):

BuildCtx funcctx(instance->startctx);

Add a function declaration (this could be more automated):

funcctx.addQuotedCompound("virtual size32_t getRow(ARowBuilder & crSelf, unsigned row)");
ensureRowAllocated(funcctx, "crSelf");

And bind the cursor's selector to the counter object ('row' is the current id):

BoundRow * selfCursor = bindSelf(funcctx, instance->dataset, "crSelf");
IHqlExpression * self = selfCursor->querySelector();
associateCounter(funcctx, counter, "row");

And finally, build a transform's body using the bound counter (self):

buildTransformBody(funcctx, transform, NULL, NULL, instance->dataset, self);

With this, your class will be exported as overriding a CHThorTempTableActivity's helper (CThorTempTableArg), so whenever that node of the graph is executed in any engine, the workunit (shared object calling graphs to execute) will pass it to the engine, which will use your methods to build a dataset.

The Test

Now that we have the basic functionality working, we need to make sure we will be able to handle all new cases we're considering (and hopefully make sure we fail where we should fail). To do that, add a test case with the possible syntax you'll expect to work, and one with the things you expect will fail (note that the pull request mentioned doesn't have this!).

If all of them compile, you're on the right track. But you have to investigate the generated C++ code to make sure it's doing what you expect it will. Since we are only re-using an activity, the activity you generate has to be similar to other implementations of the same activity. To check that, check the other files that implement the 'CHThorInlineTableActivity' and compare.

You need to be careful with the exact code. We want 'numRows' to reflect the uncertainty passed via ECL, on stored variables, for example. We want the 'getRow' to make sure it's updating the counter every time it returns a row and that it does so at the right time. See:

Git commit: Dataset count range starts at 1

The pull request above fixes a bug where the original code assumed the rows started at zero, when in ECL they actually start at one.

If you are lucky enough to not have to add activities with your changes, you can also add tests to the regression suite[2] to make sure the output of your new class is in sync to what you expected in the first place, possibly using the same (or a similar) test as you used in the compiler regression.

Once you're happy with the output, no other test fails and the failures on your new construct are being caught on your negative test, you're ready to submit this pull request.


[1] The compiler regression suite is a diff-based comparison, where you run it with a clean top-of-tree version (of the branch you're targeting to) and run again with your changes and diff the results (logs, XML of intermediate code and resulting C++ files).

A tutorial on how to run the regressions and how to interpret the differences (which can be daunting, sometimes) is on the way of being created. Please, refer to 'ecl/regress' directory and the regression scripts '' or 'regress.bat' in it for more information in the meantime.

[2] The regression suite is a set of tests that compiler and execute code on all three Engines (Thor, Roxie and HThor). Please refer to 'testing/ecl' directory for more information.

A tutorial on how to run the regression suite (not the same as the compiler regressions above) should take a while, since the underlying technology is changing. Ask on the mailing list for more info.


The next step on the tutorial is Step 2: The Distributed Flag, and Execution Tests.