The ECL Intermediate Representation

The ECL compiler

When ECL code is compiled, the internal representation is a graph of expressions that correlates each ECL instruction as a dependency to others. The compiler then walks through this expression graph, looking for patterns, dead code, common expressions and so on, until supposedly optimal code is printed at the end.

Most C/C++ compilers print object code, that can be executed directly onto a target machine. Others, like Java compilers, produce binary code that gets interpreted by a virtual machine, running on the target machine.

However, like most domain-specific languages (DSL), ECL is translated to a lower-level language. The ECL compiler produces C++ code, that gets compiled to dynamic objects that, in turn, will be loaded onto the machines of the HPCC cluster.

The expression graph

The internal representation of ECL code is what gives it most of its power. The compiler can understand the relationship between the data operations, clean-up most of it, delaying execution of the rest, combine instructions, distribute operations, etc. The resulting code is highly-optimised ECL, still in a graph form, that gets lowered to C++ classes (a full post about that is on the line).

This internal representation is what most compiler engineers would expect. Each node has children (the dependencies) and annotation. That may look simple, but ECL has some features that can make the expression graph hard to understand, like for example, omitting arguments (either implied, or skipped via additional commas).

So, the motivations to create a complete and unique textual representation of the internal representation (ECL IR) are:

  • to help new people getting acquainted with the structure
  • to document what the constructs are and what are their children and annotations
  • to document how to manipulate them and find the information you want
  • to make sure the representation is consistent and unique

The ECL IR

The ECL IR is a graph, so it needs to be represented in a graph structure. Since there is no way to create a textual representation of a graph that looks nice enough (unlike trees), the most obvious representation is a list of nodes with the links connected by the node ID, generally represented by %N, where N is the numeric id.

For example, examine the code below:

foo := PROJECT(dataset, transf(LEFT, COUNTER));

This code projects a given dataset using a transformation called transf. As expected, the dataset and the transformation are the arguments to that project. So we give the result of that operation into a new node. It's also important to represent the type. While most types are obvious and redundant, some are not and would make the resulting test non-unique.

An initial idea on how to represent that is:

%10 = table project (?, ?, ...);

Now, we need to define the arguments. You could expand them inline, but then, if you share the arguments with another project, you'd have to do it twice, and it'd be impossible to know if they were the same node, or just two identical nodes. So we create a new node for it, and reference:

%8 = table dataset(?, ?, ...);
%9 = transform transform(?, ?, ...);
%10 = table project (%8, %9);

Finally, this node has a name (foo) and a position in the original ECL code, so that annotation can also be expressed in this IR format:

%8 = table dataset(?, ?, ...);
%9 = transform transform(?, ?, ...);
%10 = table project (%8, %9); [symbol foo, location "foo.ecl":12]

Now, %10 is the node that represents foo in the source code, and it's pretty obvious to what you're pointing at if you follow the code upwards.

This textual format is loosely based on the LLVM's IR representation, and its use is somewhat similar: To uniquely represent the graph, to enable dumping (and restoring) state, to investigate and debug errors in the graph, to educate and document the internal (pointer-based) format.

The future

This textual representation is by no means complete, yet. There is an ongoing debate to how it should be implemented, represented and provided to the compiler in a way that is both useful and consistent. And although our goal is to make it unique in its representation of the internal format, it'll take a long time to actually get there.

To make sure we reach maturity (and to monitor how far do we need to go), we'll create a test infrastructure that will first read ECL, then dump IR and read it back again. The graphs should be identical. If we manage to do that for all ECL tests we have in our regression tests (a few thousand), we can claim it's reasonably complete and start using it internally, as optimisation logs (verbose mode), and compiler tests.

The compiler tests are an interesting side effect, since today we mostly compare generated ECL and C++, which is a bit painful if you have to do that to 4k+ pieces of code, some times not trivial code. But if the IR is a direct representation of the tree, a quick comparison of it could show much more (and take away much of the noise) than investigating optimised ECL or generated C++, and could also lead the way to automated testing using pattern matching techniques.