Projecting fields into EMBEDs

One downside of using embedded database calls such as MySQL or Cassandra in your ECL code was that specifying the fields to be returned (or passed in, when inserting rows) was a little clunky and potentially inefficient. Projecting fields into EMBEDs makes this process much easier and more efficient in HPCC Systems 6.2.0.

Let’s take a step back and review the approach ECL Developers may have been using to date and then take a look at how to use this new feature.

You might have ECL code that looks something like this:

MySQLRow := RECORD
  UNSIGNED id;
  STRING firstname;
  STRING lastname;
END;

DATASET(MySQLRow) MySQLNames(UNSIGNED id) := EMBED(MySQL)
  SELECT * FROM MyTable WHERE id=?;
ENDEMBED;

Then in your ECL code you can use MySQLNames(id=‘value’) just as you might use a filter expression on a native ECL dataset. While this feels nice, clean and simple, there are some gotchas to be aware of.

The first gotcha is that you are assuming that the table in MySQL was defined with exactly the three fields specified, in the order specified. If someone adds a field to the MySQL table then the code stops working. This can be easily addressed in the following way:

DATASET(MySQLRow) MySQLNames(UNSIGNED id) := EMBED(MySQL)
  SELECT id, firstname, lastname FROM MyTable WHERE id=?;
ENDEMBED;

Which brings me to the next gotcha. It does get a bit tiresome if there are more than a handful of fields involved, because the information is replicated in the RECORD definition and the SELECT statement . It’s also potentially error prone. If you were to accidentally specify the fields in a different order in the two locations, you would most likely get everything doing a great of job of looking like it is working but with a subtle bug.

There’s also an efficiency issue here. Supposing you have defined MySqlNames as above and then used it in a context where only some of the fields are needed. For example,

INTEGER inID : STORED(‘inID’);
OUTPUT(MySqlNames(id=inID), { lastname } );

This code will work, but will fetch fields from the MySQL database (id and firstname, in this case) that are then discarded.

In HPCC Systems 6.2.0, there is a solution to both of these gotchas, which allows EMBED code like this to take advantage of the optimizations that the eclcc compiler applies to propogate information about which fields are needed as far up the execution graph as possible. In this case, the compiler knows that only the field lastname is needed, but (until now) it had no way to pass this information back to the MySQL library. This is where the new PROJECTED option helps, which is best described using an example:

DATASET(MySQLRow) MySQLNames(UNSIGNED id) := EMBED(MySQL : PROJECTED(‘**’))
  SELECT ** FROM MyTable WHERE id=?;
ENDEMBED;

The parameter to PROJECTED (‘**’ in this example but you can use any string you like) will be substituted (at compile time, in most cases) for a list of the names of all the fields in the output record. That solves the first issue I mentioned of the tedium (and scope for error) of having to supply the field names twice. But because what is actually substituted is the list of field names after the compiler’s optimization phase has eliminated any fields that are not used by subsequent ECL code, it also solves the second issue of fetching more data than is required simply to make the ECL code more easily maintained. Now you can have your cake and eat it too.

This PROJECTED option is supported by all the embedded plugins (the substitution is done by the compiler before the plugin actually gets to see it). You could use it in embedded Python too, if you can think of a sensible way to use it!

As an aside, I mentioned above that the substitution of the PROJECTED string is normally done at compile time. There is one case where it is not, which is when another new HPCC Systems 6.2.0 feature is in use. I hesitate slightly to mention this new feature as it can be abused (Bobby Tables, I’m looking at you… see https://xkcd.com/327/) but in 6.2.0 and later the code being executed via EMBED does not need to be constant, for example:

SqlTableName := ‘MyTable’ : STORED(‘SQLTableName’);
MySQLQuery := ‘SELECT ** FROM ‘ + SqlTableName + ‘WHERE id=?’;
DATASET(MySQLRow) MySQLNames(UNSIGNED id) := EMBED(MySQL, MySQLQuery : PROJECTED(‘**’));

In such cases, the substitution of PROJECTED is delayed until runtime.

Notes:

  1. HPCC Systems 6.2.0 is available for download as a release candidate now.
  2. View the JIRA ticket and GitHub pull request for this feature.
  3. Read more about using your favourite programming language or datastore with HPCC Systems.
  4. Find out more about the ECL Language.
  5. Information for developers about the uses and abuses of the EMBED feature.
  6. Embedding C++? This blog about best practices is a must read.