Compiler Tutorial, Introduction

This tutorial will walk you though adding a new feature in the compiler, making sure it executes correctly in the engines, and performing some basic optimisations such as replacing and inlining expressions.

When adding features to the compiler, there are two main places where you have to add code: the compiler itself, including the parser, the expression builder and exporter, and the engines (Roxie, Thor and HThor), including the common graph node representation.

You need to make sure all possible variations of your new construct will work, not only by itself, but in conjunction with other features of ECL, by creating exhaustive tests on both compiler and regression suites.

Finally, we’ll see how to add flags, optimise another query into your optimal new construct and allow them to be exported inlined.

The aim to this text is to appear as a PDF document to guide people changing the ECL compiler, but I have decided to post it in full on the blog, as a request for comments as well as providing early access to it.

The Feature

This walk-through is based on the implementation of:

DATASET(count, TRANSFORM(..., COUNTER, ...))

This DATASET syntax will execute the TRANSFORM 'count' times, passing it as a parameter to it, where numerical fields are expected, to build incremental datasets. This feature is useful for creating test tables, where the data is used to test other features, or accessory tables, when joined with other tables could help you organise them.

There was another syntax that is used to achieve the same functionality, when the dataset had only one ROW:

NORMALIZE(dataset, count, TRANSFORM(..., COUNTER, ...))

This syntax is not clear to what is its intentions and sometimes required the creation of a dummy dataset, which made code less readable. We also wanted to make that operation distributed across the nodes, and to do so on a syntax that is already known and complex (like NORMALIZE, with so many other uses) was more complex than on a new syntax.

So, it was clearer (and easier) to add a new simple (and meaningful) syntax, get NORMALIZE to optimise to it on certain conditions, and distribute the DATASET.

We'll follow the commits in Github as a real-world annotated walk-through on how to implement new features in the compiler, new activities in the engine and provide a way to test it. It might not be the optimum path, but it is a real one and will help you understand the kind of problems we try to solve and how we do it in the wild.

Each step will be referenced by its pull request in GitHub, so you can refer to them as a complement of this tutorial.

The Files

All compiler files are within the directory 'ecl/hql' in the source tree, including the parser, tree builder, optimisers and exporters. You'll add your new feature on those files, and you'll need some tests under 'ecl/regress' to make sure the compilation part of the process is sane.

We use bison to generate our parser from Yacc files. The main file holding the whole grammar is 'hqlgram.y'. This file contains all definitions, reserved keywords and general structure of the language. 'hqlexpr.cpp' is the core of the tree builder, while 'hqlopt.cpp' and 'hqlfold.cpp' are the main optimisers, the former for general optimisations and the latter mostly for folding expressions.

Roxie activity files under `roxie/ccd`, Thor's under `thorlcr/activities' and HThor's under 'ecl/hthor'. Those files need to be changed if you're adding not only a new syntax (ie. a different way of performing the same activity), but also a new activity, or at least, changing the way the activity is executed.

The contents of this tutorial are expanded into the next four posts:

Step 1The Parser, The Expression Tree and the Activity.
Step 2The Distributed Flag, and Execution Tests.
Step 3The Optimisation, and More Tests.
Step 4Inlining and Conclusion.