Adventures in GraphLand III (The Underground Network)

If you have digested my blog on The Wayward Chicken then as you gaze out upon GraphLand you should no longer be thinking of a land of clearings connect via a maze of pathways. Rather we have a modern industrial landscape; every node is a factory with a rail link to the neighboring factories. Raw goods (data) regularly travel towards each node to be processed into a more refined good which may in turn become the raw material for a neighboring factory. Conceptually, (though not factually) the process happens for every node synchronously until everything is complete. Only then are orders serviced and that process usually just amounts to assembling the parts already made (quite possibly all within a single factory).

Our scene readily maps onto pure graph theory which would like us to believe the world consists of nodes and edges (balls and sticks) and that everything can be mapped onto this model. In truth it can; but it requires a shift of mind that I hope will repulse you as much as it does me.

In our opening data model we had people and relationships between people and all our data fitted nicely into the model. How do I add the concept of ‘Vehicles’ to GraphLand? To stick with my stick and ball model I now have to give every node an extra property the ‘Type’ which would be ‘Person’ or ‘Vehicle’. I then have to add vehicle properties to my node (which will be null for people). I now have to adjust all of my processing logic to ensure I am filtering (at process time) down the types of entity I am interested in.

In fairness there are benefits to this ‘model’; it is a data model which by definition can handle anything. The cost is that the logic around how to process the entity types has to be pushed down into the queries and worse all of the resolution and checking happens at run time. To return to our analogy; every factory could process different things a different way and at different times and we need to enter each factory to find out what it does. This could actually be a motivator behind the Wayward Chicken programming method.

The KEL model, and thus GraphLand, behaves very differently. In our first blog I introduced Entities and Associations and then used the term interchangeably with nodes and edges. This is not strictly correct. Within KEL you can define multiple entity types and multiple types of associations inter or intra entity. You can think of this as having multiple layers of factories burrowing underground with various pipes and tubes carrying goods from layer to layer. Within a single layer everything is homogenous and synchronous and the transport between layers is also homogenous and synchronous but the result is that complex heterogeneous queries can be achieved.

Enough theory – let’s get to the data.

First with apologies to all car buffs out there; here is the Bayliss interpretation of all interesting facts about our vehicles.

Secondly, relationships file between people and vehicles. Note this file contains information regarding two different types of relationship between Person and Vehicle, Owns and Drives. The ‘per’ column is the entity ID for the person, the ‘veh’ column is the entity ID for the Vehicle.

These can be declared and used within GraphLand as easily as before:

It should be noted that if I personally were doing this I would have split my relationship file into two and imported one as an association ‘owns’ and one as ‘drives’; I would then have had rather nicer association names, compile time checking and nicer queries. However, for the sake of showing off some KEL features, I am bringing in the associations with the types intact and we will validate them at query time.

We may remember from our previous blog that associations for a given entity are now accessible as if they were child records of each entity. Thus the PerVeh association now exists for people and for vehicles.

This simple query provides a list of people with the vehicles they drive or own:

QUERY:Dump1 <= Person{UID,Name,


Of course we do not really want to use KEL simply to dump out data we already had. What we really want is to compute new information. So here is a question for you: “How many vehicles is each person in the list the sole owner of?”

I would imagine that most of you started:

COUNT(PerVeh.Veh(< Ouch, this is going to hurt> ))

You actually can make it work that way; but it is much better if you decompose the problem. The first question one should ask is: “Which vehicles have only one owner?” This becomes very simple:

Vehicle: => nOwners := COUNT(PerVeh(Type='OWNS'));

Within our underground network of vehicles it is very easy to count the number of people relationships. The new thing is that the associations can have values (such as type) and they can be part of filters too.

Then within our surface entity (people) it becomes fairly easy to compute the number of vehicles a person is the sole owner of:

Person: => nWholeOwned := COUNT(PerVeh.Veh(Person.PerVeh.Type='OWNS',nOwners=1)):

The COUNT and the nOwners=1 should be familiar. The slight ugliness is the checking that we are dealing with ownership rather than simply driving; you may remember my saying I didn’t like late-typing!

We have to check Type=’OWNS’; the problem is that the filter is in the scope of the vehicle which doesn’t have a Type attribute. KEL allows you to pull data from anywhere on the path to a data element; but you have to state the place in the path you are pulling the data from. Hence: Person.PerVeh.Type=’OWNS’.

As an aside: KEL is a language under development, we do have an issue open to allow a rather prettier form:

Person: => nWholeOwned := COUNT(PerVeh(Type='OWNS').Veh(nOwners=1));

Which works if a filter is resolvable before the full path is known.

Putting this all together we have:


USE File_Person(FLAT,Person);

Vehicle := ENTITY(FLAT(UID,Make,Colour=Color));
PerVeh := ASSOCIATION(FLAT(STRING Type,Person Per,Vehicle Veh));

USE File_Veh(FLAT,Vehicle);
USE File_PerVeh(FLAT,PerVeh);

Vehicle: => nOwners := COUNT(PerVeh(Type='OWNS'));
Person: => nWholeOwned := COUNT(PerVeh.Veh(Person.PerVeh.Type='OWNS',nOwners=1));

QUERY:Dump <= Vehicle{Make,Colour, nOwners,

QUERY:Dump1 <= Person{UID,Name, nWholeOwned,

Dump (a dump of vehicles with people as the children) yields:

Notice the use of the path walking syntax to annotate the child person records with the information from the link.

Dump1 is then rather simpler:

There is much more fun we can have with this data as we inch into some queries that are necessarily second order. First though I want to tackle the sub-graph matching problem; or rather point out that we have actually tackled it already.


Different Types of Joins

Matching records from multiple data sources is one of the fundamental operations you need to process data. As you would expect ECL makes it easy – but the number of options can be bewildering. The following aims to guide you through the different options, and explain when they are appropriate.

The "simplest" way to match records between two datasets is to pair each record in one dataset with each record in the other dataset, and then filter the pairs to extract the matches. This is what JOIN(,ALL) does. Unfortunately as well as being simple it is likely to be very slow unless the datasets are very small. If the first dataset has M records, and the second N, then it will typically execute in O(MN) time. Thankfully if the match condition contains any equalities then there are much better approaches.

One of these approaches is to sort both datasets by the equality fields, and then step through the two datasets in step. You have the overhead of sorting the datasets which is typically O(MlnM) + O(NlnN), and finding the matching pairs is typically O(M+N). This is generally much quicker than the exhaustive approach above, and is the normal approach used in Thor.

Roxie uses a different implementation by default – instead of sorting the two datasets it starts by creating a hash table from the second dataset. Then each row from the first dataset is looked up in the hash table to find the matches. With the current implementation the "right" dataset is still sorted (to simplify duplicate handling and many lookups), but the order of the left is preserved. If the right dataset is relatively small then this will be significantly faster than sorting both datasets. This type of join can be selected in Thor by adding a ",MANY LOOKUP" attribute.

If Roxie uses a LOOKUP join by default why doesn’t Thor? The limitation of a LOOKUP JOIN is that the entire right dataset needs to be held in memory. That is a reasonable assumption to make for Roxie, where the datasets are generally small, but it isn’t for Thor.

So far the discussion has assumed it is easy to compare a pair of rows, but it is only efficient if the rows are present on the same node in the cluster. Different methods are used in Thor to ensure the rows are present on the same node:

  • ALL joins and LOOKUP joins clone all the rows from the right dataset onto each of the nodes. This is only really efficient if the right dataset is reasonably small.
  • The sorted joins normally globally sort the first dataset, and use the distribution of records created from the first sort to distribute the second dataset to match.
  • An alternative is the HASH join. This distributes both datasets by a hash of the equality fields, and then performs local sorts after the datasets have been distributed.

If the ECL user knows already that the rows from the datasets must be on the same node, then they can opt to use a LOCAL variant of the join to avoid any redistribution. Sometimes, if the distribution of the left dataset is already known, the code generator will automatically distribute the right dataset to match, and then perform a local variant of the join.[+]

So in summary…

  • If the join conditional has no equalities then you need to use ALL.
  • If the right dataset is small you can use MANY LOOKUP.
  • Otherwise use a global sort.
  • You can use HASH to change the way distributions occur. (Sometimes this improves the performance.)
  • If matching rows are already on the same nodes you can add ,LOCAL.

But what if you don’t know how large the right dataset is? Using a standard join will work whatever the size of the datasets, but if the right is small then a lookup join would be significantly more efficient. One way to optimize for small datasets is to use conditional code based on an estimate of the size of the dataset. Unfortunately it makes the graphs ugly and confusing, and has to be pessimistic to avoid jobs failing because of insufficient memory. Why can’t it be done automatically? That’s where the new Thor SMART join comes in.

The SMART join processing code starts off assuming the right dataset is small, but gradually falls back to implementations that can cope with larger number of records. It goes through the following sequence:

  • Initially it used a MANY LOOKUP, duplicating the rows onto each node.
  • If it runs low on memory it uses a hash of the equality fields to distribute each row to a particular node. It then distributes the left dataset to match that distribution and uses a LOCAL MANY LOOKUP join on each node.
  • If there is still insufficient memory for the local hash table, the implementation falls back using a local join which sorts both sides on the re-distributed datasets. (Essentially a HASH JOIN).

There is little penalty from using a SMART join, but potentially a significant advantage for small datasets. If you are joining to a dataset of unknown size then it is worth experimenting by adding ",SMART" to your join. It is quite likely to become the default join implementation for Thor in a future version.

(Note, a SMART join can’t be the default implementation of a MANY LOOKUP join because a smart join may change the distribution of the left dataset . The code generator tracks the distribution of each dataset and use the information to optimize the query – those assumptions and optimizations would be broken if the dataset was redistributed.)

Pseudo Joins

There are few examples of ECL usage which are effectively joining two datasets together, but it is cunningly disguised...

  • An implicit dataset lookup:


    This occurs when a dataset is being searched for a field that matches a value, and then extracting a field from the matching row. If the dataset is small you are much better off using a new DICTIONARY support. Otherwise it would be much better to JOIN against the dataset (using a LEFT OUTER join if there may be no matching element, and KEEP(1) if there might be more than one).
  • Filtering by existance in a set:

    myDataset(searchValue IN SET(otherDataset, value))

    This code is searching to see if a value is contained in another dataset. Again if the dataset is small a dictionary is likely to be much more efficient. If it is large then an INNER JOIN (with KEEP (1)) would often be more appropriate.

Other JOIN options.

Sometimes most rows match relatively few rows in the other dataset, but there are a few instances of the match condition which result in very large numbers of records (e.g., john smith). Many of the options on a join are there to help cope with these pathological matches.

  1. LIMIT(n)

    If there are more than n matches for a particular row then abort the job.
    Note: LIMITs and ATMOSTS are only applied to the portion of the join condition that uses equalities. (Often called the hard match condition in some of the documentation.)

  2. LIMIT(n, SKIP)

    If there are more than n matches for a particular row exclude any matches for this row from the output.

  3. ATMOST(n)

    If there are more than n matches for a particular row the treat it as if it didn’t match anything.
    It is very similar to LIMIT(n, SKIP), but the difference is that for a LEFT OUTER join ATMOST will generate an output record. This allows you to return information about rows that had too many matches in the results.
    It is worth adding ATMOST(1) to a LEFT OUTER join that can only match 1 record. It will allow the code generator to remove the join if it isn’t needed.

  4. ATMOST({cond1, cond2, .., condn},n)

    If there are more than n matches for a particular row, apply each of the extra optional conditions until the number of matches is below the threshold.
    This allows you to apply a fairly broad join criteria for most pairs of rows, but add extra criteria if the equality fields are very common (e.g., some last names are much more common than others, so add extra conditions on first name, or other criteria). It is often used when the match condition also includes an edit distance match function to ensure the number of pairs processed by the edit distance function is reasonable.

  5. KEEP(n)

    Only generate output records for the first n rows that match the full join condition.

There are a few other options which are detailed in the language reference manual, which can help in fairly specific circumstances, but generally you should only add options for good reason.

[+] This new optimization has occasionally caused problems where ECL programmers have added a DISTRIBUTED() operation into some ECL to remove a warning, but the distribution expression wasn’t correct. They had used DISTRIBUTED(dataset, a-arbitrary-expression) instead of DISTRIBUTED(dataset) which indicates the dataset is DISTRIBUTED in an unspecified way. The moral of the story – never lie to the ECL compiler or it might come back to bite you.

Adventures in GraphLand (The Wayward Chicken)

The other evening I was distracted from a lengthy debug session by a commotion coming from the chicken coop. I raced out the door fearful that some local predator had decided to attempt to steal some dinner. I discovered no such thing; rather ‘the girls’ were all agitating to get out of the coop. In my concentration I had forgotten that it was past the time they are normally allowed out for their daily forage. I immediately opened the coop door and with an explosion of feathers the girls hurtled from the coop in all directions across the yard; each one individually convinced that the particular location that they had chosen to run to was liable to yield the most and tastiest bugs. Last out of the coop stalked our rooster with a grim look on his face; he would be spending the next hour trying to round his charges into the same place so that he could protect them from predators.

This tale of bucolic bliss is not provided to amuse; rather it illustrates the behavior of most programmers when first presented with a graph to analyze. Some deep primal instinct somehow persuades the otherwise analytic mind that the best way to tackle a graph is to scamper all over the place collecting snippets of information as one travels. One of our top technologists responded to my first blog with: “How do I get as far as possible as quickly as possible? Can I go backwards and forwards along the same path? What if I run around in circles? What if I get trapped? How do I know if I wander so far that I’m about to fall off the map?” Being a top technologist his language was rather more erudite (shortest path, directedness, cyclicity, connectedness and centrality) but the intuitive meaning is identical.

I don’t know what causes programmers to do this but I suspect it may be related to the (false) Urban Myth that everyone is related to Kevin Bacon by six degrees of separation. As part of a drastically over affirmed society our take-away from this myth is that there is yet another reason why we are special. I would suggest that if everyone is within six degrees of Kevin Bacon then the correct take-away should be that there is no significance to being within six degrees of Kevin Bacon.

Rather more mathematically; let’s assume that everyone has 25 close associates (relatives, co-workers, neighbors, in-laws) and as a first approximation let’s assume that the graph spans out evenly and without cycles. Then by six degrees of separation you have 25^6 = 240M people; around the adult population of the US. If I were to say: “There are two million criminals in the US, you are in the US, therefore you might be criminal” people would think I were crazy. If I were to say that I had computed that two million of your sixth degree relatives were criminal; then you would be in for questioning. The reality is that the two statements are essentially the same.

Even if we narrow our sites far further, if you view ‘relatives’ as a social network (25 people) rather than pure blood (3-9 people) then by third degree relationships you have fifteen thousand people; you would expect around 150 of them to have done jail time simply by random chance. At a statistical level if your graph has a branching factor of 25 then a fact regarding a second degree relative is twenty five times less significant than one relating to a first degree relative.

As such: the best and most significant information is always close to home.

This insight is one of the core drivers behind the KEL programming philosophy. Whilst some graph languages proudly declare they are ‘graph traversal’ languages; KEL is a knowledge engineering language. Specifically:

KEL exists to concentrate knowledge from the graph into the entity (not to scatter the query from the entity to the graph).

To illustrate the difference I want to tackle a question for the data presented in the last blog: “how many of your first degree relatives are the youngest person they know?” To make this easier to visualize here is the graph:

To answer the given question for ‘David’ here is a plausible graph traversal approach:

If you execute the above you will visit the following nodes in sequence:

So a query about David causes us to visit 15 nodes. As an exercise, you should try writing out the nodes visited if I wanted to know the average number of youngest relatives for the two oldest people in the graph. You should visit 29 nodes.

The KEL methodology is rather different:

Pleasingly it is rather easier to read in KEL than it is in pseudo-code!

Person: => Youngest := COUNT(Relationship(whoelse.Age<Age))=0;
Person: => YoungRels := COUNT(Relationship.whoelse(Youngest));

This is not a language tutorial but there are a couple of aspects of the above worth noting:

  • The ‘Person:’ defines the scope of the operation. It applies to Person nodes.
  • The space between the ‘:’ and the ‘=>’ allows a precondition for the operation to apply; that will be a later subject.

    The interesting piece occurs after the ‘:=’ so we will look at the first in detail:

    • For a given entity all of the associations are available as if they were a child dataset with the label of the association.
    • Implicitly, when in the scope of an entity, the references to any associations fix the first matching entity in the association to self.



    in the person scope would give you the number of relationships for each person.

    For this particular computation we want to filter those associations further, to those in which the OTHER person in the relationship is younger. We do this using: (whoelse.Age<Age). If there is no-one who is younger than you are then you are declared to be the youngest.

    Subsequent to this line every Person node has a new property available ‘Youngest’ which can be queried as used as any other property. Thus in the second line, we can compute the number of youngest relatives simply by counting and using the Youngest property as a filter.

    The nodes visited to compute the query for David are:

    Er, which is 22! That is 7 more visits that the normal method! Why would I go to such pains to make my life harder? Well, now trace through how many extra visits do I have to make to compute the average of the two oldest people? The total should be 26; a savings of 3. Now suppose I want to know the average number of youngest relatives across the whole graph; suddenly the KEL solution begins to pull substantially ahead.

    So what is happening? In the traditional model, each computation for each entity moves away from the entity to the first degree relatives which in turn have to move to their first degree relatives to compute the answer. Thus most of the computations are performed at the second degree from the original entity being considered.

    If viewed programmatically, within the KEL model this second degree query is written as two subsequent first degree queries. Of course, the KEL compiler may or may not choose to execute the query that way, and it will almost certainly have different strategies for thor and roxie; but logically that is how it is specified.

    We should not however, view KEL programmatically. Rather we should view KEL, or at least the KEL logic statements, as a graph annotation model, or better a knowledge concentration model. With each logic statement, knowledge that could be gleaned by traveling away from a node is moved and stored within the node itself. Queries then execute against the knowledge rich graph that is produced.

    There are, of course, examples when a graph query cannot be decomposed into a succession of simpler queries. Those are slightly harder to code and significantly harder to execute; they will be the subject of a later blog. For now just keep in mind; if you can code it as a succession of near-relative encounters then you should do so. In the mean time we need to tackle heterogeneity; what do you do if not all of your entities are the same type? That will be the subject of our next blog.

    In closing I would like to present our full ‘program’ as we have it so far; and the answers:

    Person := ENTITY(FLAT(UID,Name,INTEGER Age));
    Relationship := ASSOCIATION(FLAT(Person who,Person whoelse));

    USE File_Person(FLAT,Person);

    USE File_Relat(FLAT,Relationship(who=person1,whoelse=person2)

    Person: => Youngest := COUNT(Relationship(whoelse.Age<Age))=0;
    Person: => YoungRels := COUNT(Relationship.whoelse(Youngest));

    QUERY:Dump1 <= Person{UID,Name,Age,Youngest,YoungRels


    Incidentally: if anyone has the time and inclination to tackle these queries in alternate graph languages I would be very interested to see the results. I would also love to see if someone wants to have a swing at this in ECL; it will help me keep the KEL compiler writer on its toes.

    Tips and Tricks for ECL -- Part 1 -- Bit Fiddling

    I get fairly frequent emails asking, "How can I ... (do something) in ECL?" I've saved the example code from many of these questions, so now I'll begin sharing them through this blog.

    If anybody has this type of question they'd like me to answer (in the blog), please just send me an email to and I will do my best to come up with an ECL example demonstrating the solution.

    All the code examples in this series will be stored in the "BlogCode" directory of my repository. So you will see IMPORT BlogCode used frequently.

    We'll start now with several questions I've gotten about "bit fiddling" in ECL.

    A Couple of Useful Functions

    The first thing to know about "bit fiddling" in ECL is that the bitwise operators will all only work on integer data types. Therefore, all the following examples are all designed to work with an UNSIGNED4 bitmap, although you can certainly use them with smaller bitmaps if you need to.

    We'll start with a couple of functions I wrote that are generally useful when you're working with bits:

      SHARED Mask := (UNSIGNED4)POWER(2,WhichBit-1);
      EXPORT BOOLEAN   IsOn := BitMap & Mask <> 0;
      EXPORT UNSIGNED4 Flip := Bitmap ^ Mask;

    I created a MODULE structure to contain these, since they are related functions and both take the same two parameters. They also both use the same expression, so I was able to simplify the code by defining the SHARED Mask definition used by both EXPORT functions.

    For the SHARED Mask definition, the POWER function raises the base value (2) by the exponent defined by WhichBit-1. This makes use of the mathematical fact that an exponent of zero is always 1, an exponent of one always returns the base number, and every other exponent will create the appropriate value so that the Mask (as a decimal value) will always be the value 1, or 2, or 4, or 8, or 16, etc. so that the only bit turned on in the resulting UNSIGNED4 Mask is the one bit in the ordinal position specified by the WhichBit parameter.

    The IsOn function returns a BOOLEAN indicating whether the specified bit is a 1 (on) or 0 (off). It starts by detecting an empty bitmap and immediately returning FALSE in that case, otherwise it will use the Bitwise AND operator (&) with the passed in Bitmap and the Mask to determine if the specified bit is on or not.

    The Flip function returns an UNSIGNED4 as the new bitmap after it has flipped the value of the specified bit. If the passed in bitmap as a 1 (on) in that bit it will change it to 0 (off), and vice versa. It uses the Bitwise eXclusive OR operator (^) with the passed in Bitmap and the Mask to flip just the value of the specified bit, leaving all other bits as they were.

    Here's an example of how these are used. This can be run in any builder window:

    IMPORT BlogCode;
    UNSIGNED4 MyBitmap := 4; //or in binary -- 00000000000000000000000000000100b
    BlogCode.Bit(MyBitmap,1).IsOn;  //returns FALSE
    BlogCode.Bit(MyBitmap,2).IsOn;  //returns FALSE
    BlogCode.Bit(MyBitmap,3).IsOn;  //returns TRUE
    NewBitMap := BlogCode.Bit(MyBitmap,2).Flip; //turn on second bit 
      //making this -- 00000000000000000000000000000110b
    BlogCode.Bit(NewBitMap,1).IsOn;  //returns FALSE
    BlogCode.Bit(NewBitMap,2).IsOn;  //returns TRUE
    BlogCode.Bit(NewBitMap,3).IsOn;  //returns TRUE


    When David Bayliss reviewed the above code, he said that I had "just inserted a floating point operation into low-level bit twiddling. This is HORRIBLE." So he suggested an alternative method that just uses the bitshift left operator (<<) along with the Bitwise AND (&) and Bitwise OR (|) operators, like this:

      EXPORT BOOLEAN   IsOn :=   Bitmap & (UNSIGNED4)(1 << WhichBit-1) <> 0;
      EXPORT UNSIGNED4 Flip := Bitmap | ((UNSIGNED4)1 << WhichBit-1);

    This method removes the need for that POWER function.

    So What Can You Do With Bitmaps?

    Someone emailed me with this request:

    "Thought you might have a trick up your sleeve to attack this type of encoding. Basically an integer value is used to represent a list of values where each bit represents a separate item.

    "In the example below, each bit set represents an index into a string to get yet another code.

    // Valid codes  " VURQLKGFEDCABIJOWY",   
    // Position      1234567890123456789
    // Number                  position              Code
    //   32 = 2^5,              5  + 1                L
    // 4096 = 2^12              12 + 1                A
    //  512 = 2^9               9  + 1                E
    //  544 = 2^5 + 2^9                               LE

    "How do we write ECL code to translate a number to a code?"

    So here's the first code I wrote to handle this specific problem:

    IMPORT $;
    		Code32 := IF($.Bit(B,32).IsOn,Codes[32],'');
    		Code31 := IF($.Bit(B,31).IsOn,Codes[31],'');
    		Code30 := IF($.Bit(B,30).IsOn,Codes[30],'');
    		Code29 := IF($.Bit(B,29).IsOn,Codes[29],'');
    		Code28 := IF($.Bit(B,28).IsOn,Codes[28],'');
    		Code27 := IF($.Bit(B,27).IsOn,Codes[27],'');
    		Code26 := IF($.Bit(B,26).IsOn,Codes[26],'');
    		Code25 := IF($.Bit(B,25).IsOn,Codes[25],'');
    		Code24 := IF($.Bit(B,24).IsOn,Codes[24],'');
    		Code23 := IF($.Bit(B,23).IsOn,Codes[23],'');
    		Code22 := IF($.Bit(B,22).IsOn,Codes[22],'');
    		Code21 := IF($.Bit(B,21).IsOn,Codes[21],'');
    		Code20 := IF($.Bit(B,20).IsOn,Codes[20],'');
    		Code19 := IF($.Bit(B,19).IsOn,Codes[19],'');
    		Code18 := IF($.Bit(B,18).IsOn,Codes[18],'');
    		Code17 := IF($.Bit(B,17).IsOn,Codes[17],'');
    		Code16 := IF($.Bit(B,16).IsOn,Codes[16],'');
    		Code15 := IF($.Bit(B,15).IsOn,Codes[15],'');
    		Code14 := IF($.Bit(B,14).IsOn,Codes[14],'');
    		Code13 := IF($.Bit(B,13).IsOn,Codes[13],'');
    		Code12 := IF($.Bit(B,12).IsOn,Codes[12],'');
    		Code11 := IF($.Bit(B,11).IsOn,Codes[11],'');
    		Code10 := IF($.Bit(B,10).IsOn,Codes[10],'');
    		Code09 := IF($.Bit(B,09).IsOn,Codes[09],'');
    		Code08 := IF($.Bit(B,08).IsOn,Codes[08],'');
    		Code07 := IF($.Bit(B,07).IsOn,Codes[07],'');
    		Code06 := IF($.Bit(B,06).IsOn,Codes[06],'');
    		Code05 := IF($.Bit(B,05).IsOn,Codes[05],'');
    		Code04 := IF($.Bit(B,04).IsOn,Codes[04],'');
    		Code03 := IF($.Bit(B,03).IsOn,Codes[03],'');
    		Code02 := IF($.Bit(B,02).IsOn,Codes[02],'');
    		Code01 := IF($.Bit(B,01).IsOn,Codes[01],'');
    	  RETURN TRIM(Code01 + Code02 + Code03 + Code04 + Code05 
    	            + Code06 + Code07 + Code08 + Code09 + Code10 
    	            + Code11 + Code12 + Code13 + Code14 + Code15 
    	            + Code16 + Code17 + Code18 + Code19 + Code20 
    	            + Code21 + Code22 + Code23 + Code24 + Code25 
    	            + Code26 + Code27 + Code28 + Code29 + Code30 
    	            + Code31 + Code32,ALL);

    This function takes two parameters: and UNSIGNED4 bitmap, and a STRING32 containing the valid codes for each position. This is a simple "brute force" approach that will test each bit in the bitmap and either take the code value for that position or a blank string. The TRIM function is using the ALL option to remove all spaces from the concatenated result string.

    You can test the function in a builder window like this:

    IMPORT BlogCode;
    ValidCodes := ' VURQLKGFEDCABIJOWY';
    output(BlogCode.Bit2Code(32,ValidCodes),named('Bit2Code_32'));		  //L
    output(BlogCode.Bit2Code(4096,ValidCodes),named('Bit2Code_4096'));	//A
    output(BlogCode.Bit2Code(512,ValidCodes),named('Bit2Code_512'));		//E
    output(BlogCode.Bit2Code(544,ValidCodes),named('Bit2Code_544'));		//LE
    output(BlogCode.Bit2Code(10000000000000000b,ValidCodes),named('Bit2Code_10000000000000000b'));	//O
    output(BlogCode.Bit2Code(10000000010000000b,ValidCodes),named('Bit2Code_10000000010000000b'));	//GO

    Once I had solved his specific problem, I decided to expand the solution to be a more generic tool. Instead of using the bitmap to indicate a set of single letter codes, why not have it indicate a set of strings of any length (up to 4K, in this example)? That's what this next version does:

    IMPORT $;
    EXPORT Bit2String  := MODULE
      EXPORT Layout := RECORD
        STRING Txt{MAXLENGTH(4096)};
        br := ROW({''},Layout);
    		Code32 := IF($.Bit(B,32).IsOn,Codes[32],br);
    		Code31 := IF($.Bit(B,31).IsOn,Codes[31],br);
    		Code30 := IF($.Bit(B,30).IsOn,Codes[30],br);
    		Code29 := IF($.Bit(B,29).IsOn,Codes[29],br);
    		Code28 := IF($.Bit(B,28).IsOn,Codes[28],br);
    		Code27 := IF($.Bit(B,27).IsOn,Codes[27],br);
    		Code26 := IF($.Bit(B,26).IsOn,Codes[26],br);
    		Code25 := IF($.Bit(B,25).IsOn,Codes[25],br);
    		Code24 := IF($.Bit(B,24).IsOn,Codes[24],br);
    		Code23 := IF($.Bit(B,23).IsOn,Codes[23],br);
    		Code22 := IF($.Bit(B,22).IsOn,Codes[22],br);
    		Code21 := IF($.Bit(B,21).IsOn,Codes[21],br);
    		Code20 := IF($.Bit(B,20).IsOn,Codes[20],br);
    		Code19 := IF($.Bit(B,19).IsOn,Codes[19],br);
    		Code18 := IF($.Bit(B,18).IsOn,Codes[18],br);
    		Code17 := IF($.Bit(B,17).IsOn,Codes[17],br);
    		Code16 := IF($.Bit(B,16).IsOn,Codes[16],br);
    		Code15 := IF($.Bit(B,15).IsOn,Codes[15],br);
    		Code14 := IF($.Bit(B,14).IsOn,Codes[14],br);
    		Code13 := IF($.Bit(B,13).IsOn,Codes[13],br);
    		Code12 := IF($.Bit(B,12).IsOn,Codes[12],br);
    		Code11 := IF($.Bit(B,11).IsOn,Codes[11],br);
    		Code10 := IF($.Bit(B,10).IsOn,Codes[10],br);
    		Code09 := IF($.Bit(B,09).IsOn,Codes[09],br);
    		Code08 := IF($.Bit(B,08).IsOn,Codes[08],br);
    		Code07 := IF($.Bit(B,07).IsOn,Codes[07],br);
    		Code06 := IF($.Bit(B,06).IsOn,Codes[06],br);
    		Code05 := IF($.Bit(B,05).IsOn,Codes[05],br);
    		Code04 := IF($.Bit(B,04).IsOn,Codes[04],br);
    		Code03 := IF($.Bit(B,03).IsOn,Codes[03],br);
    		Code02 := IF($.Bit(B,02).IsOn,Codes[02],br);
    		Code01 := IF($.Bit(B,01).IsOn,Codes[01],br);
    	  RETURN (Code01 + Code02 + Code03 + Code04 + Code05 
    	            + Code06 + Code07 + Code08 + Code09 + Code10 
    	            + Code11 + Code12 + Code13 + Code14 + Code15 
    	            + Code16 + Code17 + Code18 + Code19 + Code20 
    	            + Code21 + Code22 + Code23 + Code24 + Code25 
    	            + Code26 + Code27 + Code28 + Code29 + Code30 
    	            + Code31 + Code32)(Txt <> '');

    This code uses the same "brute force" approach, but instead of building characters for a string, it defines the records that will go into the result record set. That means that instead of using the TRIM function to get rid of blank spaces, we simply append all the result records into a record set that we filter to eliminate the blank records.

    Testing this version is similar to the previous, but we pass the dataset of text values as the second parameter. Multiple bits turned on result in multiple records in the result set.

    IMPORT BlogCode;
    SetStrings := ['','The Voice','American Idol','The X Factor','Law & Order',                  
                   'Lost','Glee','The Daily Show','Revenge','Hart of Dixie',
                   'Walking Dead','True Blood','Sopranos','Game of Thrones',
                   'Downton Abbey','Poirot','Rizzoli & Isles','Suits','Swamp People',
                   'Pawn Stars','Firefly','AC 360','Fox & Friends','Hardball',
                   'Mike & Molly','60 Minutes','The Ellen Show','Elementary','Sherlock',
                   'Here Comes Honey Boo Boo','Doctor Who'];
    ds := DATASET(SetStrings,BlogCode.Bit2String.Layout);

    Bitmapped Dates

    I was teaching an ECL class one day in Alpharetta, and we were having a discussion of the various ways you can store dates in ECL. Joe Keim came up with this solution.

    This function will squeeze a standard YYYYMMDD 8-byte date string into three bytes of storage in an UNSIGNED3 bitmap. This format has the added advantage that it will still be perfectly sortable, just as a YYYYMMDD string would be.

      YY := (UNSIGNED3)YYYYMMDD[1..4] << 9;
    	MM := (UNSIGNED3)YYYYMMDD[5..6] << 5;
    	DD := (UNSIGNED3)YYYYMMDD[7..8];
      RETURN YY | MM | DD;

    This Date2Bits function takes the first four characters of the string (YYYY) and casts that value into an UNSIGNED3 that is then shifted left 9 bits. The next two characters of the string (MM) are also cast into an UNSIGNED3 that is then shifted left 5 bits. The last two characters of the string (DD) are simply cast into an UNSIGNED3.

    Those three UNSIGNED3 values are then ORed together using the Bitwise OR operator (|) to create the resulting bit map, where:

    YYYY is in first 15 bits
    MM is in next 4 bits
    DD is in last 5 bits

    You can then use this Bits2Date function to reconstruct the YYYYMMDD string from the bitmap:

      STRING4 YY := INTFORMAT((d & 111111111111111000000000b >> 9),4,0);
      STRING2 MM := INTFORMAT((d & 000000000000000111100000b >> 5),2,1);
      STRING2 DD := INTFORMAT((d & 000000000000000000011111b),2,1);
      RETURN YY + MM + DD;

    This function uses the Bitwise AND operator (&) against a binary mask, which is then shifted right 9 bits to produce the integer value for the YYYY. The INTFORMAT function then right-justifies that integer value into a 4-byte string with leading blanks. The MM and DD values are treated the same way, except their STRING2 results are formatted with leading zeroes. The final YYYYMMDD string is a simple concatenation of the three.

    You can test these functions with code like this in a builder window:

    IMPORT BlogCode;
    d1 := '19500827';
    UNSIGNED3 date1 := BlogCode.Date2Bits(d1);
    date2 := BlogCode.Bits2Date(date1);
    ds2 := DATASET([{'InDate',d1},
                   {string10 txt,string10 val});
    OUTPUT(ds2,NAMED('Bits2Date_' + d1));

    Viewing the Bitmap

    Sometimes you just want to see what the bitmap actually looks like. You can, of course, just output the integer value, but a bitmap is best viewed in binary representation. So when Peter Vennel needed to create a string representation of a bitmap, here's my code to do that:

    IMPORT $;
      STRING1 Bin2Str(UNSIGNED1 Bit) := IF($.Bit(Bin,Bit).IsOn,'1','0');
      RETURN Bin2Str(32) + Bin2Str(31) + Bin2Str(30) + Bin2Str(29) + 
             Bin2Str(28) + Bin2Str(27) + Bin2Str(26) + Bin2Str(25) + 
             Bin2Str(24) + Bin2Str(23) + Bin2Str(22) + Bin2Str(21) + 
             Bin2Str(20) + Bin2Str(19) + Bin2Str(18) + Bin2Str(17) + 
             Bin2Str(16) + Bin2Str(15) + Bin2Str(14) + Bin2Str(13) + 
             Bin2Str(12) + Bin2Str(11) + Bin2Str(10) + Bin2Str(9) + 
             Bin2Str(8)  + Bin2Str(7)  + Bin2Str(6)  + Bin2Str(5) + 
             Bin2Str(4)  + Bin2Str(3)  + Bin2Str(2)  + Bin2Str(1); 

    This simple function just relies on the previously defined Bit.IsOn function to generate either a "1" or "0" character in each bit position. The result is just the concatenation of all 32 characters, showing you exactly what the bitmap looks like.

    Test with this code in a builder window and you'll see the result:

    IMPORT BlogCode;
    val := 532860;
    OUTPUT(BlogCode.Bit2Str(6));   //00000000000000000000000000000110
    OUTPUT(BlogCode.Bit2Str(val)); //00000000000010000010000101111100

    Convert to Hexadecimal

    I got an email from Jarvis Robinson, who needed a way to display text and the corresponding Hexadecimal values side by side. So I wrote this String2HexString function to do that:

    EXPORT String2HexString(STRING DataIn) := FUNCTION
      STRING2 Str2Hex(STRING1 StrIn) := FUNCTION
        STRING1 HexVal(UNSIGNED1 val) := 
        UNSIGNED1 Char1 := (>UNSIGNED1<)StrIn >> 4;
        UNSIGNED1 Char2 := ((>UNSIGNED1<)StrIn & 00001111b);
        RETURN HexVal(Char1) + HexVal(Char2);
      Rec := {STRING Hex{MAXLENGTH(1024)}}; 
      ds := DATASET(LENGTH(TRIM(DataIn)),
      	            SELF.Hex := Str2Hex(DataIn[COUNTER])));
      HexOut := ROLLUP(ds,TRUE,TRANSFORM(OutRec,SELF.Hex := LEFT.Hex + RIGHT.Hex));
      RETURN DATASET([{DataIn,HexOut[1].Hex}],
                     {STRING Txt{MAXLENGTH(1024)},
                      STRING Hex{MAXLENGTH(1024)}});

    This function starts with the Str2Hex function nested within, which will produce two hexadecimal characters for a single string character. It has its own HexVal function that returns a the Hex for a given value. The Char1 definition uses the (>UNSIGNED1<) type transfer shorthand syntax to treat the StrIn character as an UNSIGNED1 value, so that the Bitshift right operator (>>) will work (all bitwise operations can only be done on integer types). This simply moves the left 4 bits to the righthand nibble, making the UNSIGNED1 a value from zero through 15. The Char2 definition also uses the (>UNSIGNED1<) type transfer shorthand syntax to treat the StrIn character as an UNSIGNED1 value, so that the Bitwise AND operator (&) will work. This simply removes the left 4 bits, making the UNSIGNED1 a value from zero through 15. The return value concatenates the two hex characters into the STRING2 result.

    The key to this function being able to work with any size string (up to 512 bytes, as written, but you can modify that if needed), is the DATASET definition. This form of DATASET creates a new dataset with as many records as there are characters in the StrIn parameter. More importantly, the inline TRANSFORM populates each record with the Hexadecimal equivalent to each chacter in the StrIn.

    That leaves only the ROLLUP to composite all those records into a single one with all the Hex values concatenated into a single string. The inline form of DATASET then allows a nicely formatted one-record result to present both the input text and its hex equivalent as the result.

    Testing the code this way will show its usefulness:

    IMPORT BlogCode;
    STRING dat := 'efgh\tBCD'; //x'6566676809424344'

    Note that the input string contains a tab (\t) character, which can be difficult to see in text mode. But looking at the result in hex mode, you can clearly see the 09 hex representation of the tab.

    That's enough for this article. We'll continue with more useful code examples in subsequent articles.

    Adventures in GraphLand (Part I)

    Well over a decade ago, I had the privilege of being one of the first programmers to use ECL. A new language at a new level of abstraction allowed us to think about data absent of process. We were no longer working in the realm of time and method, but in the realm of data. We wanted a name for this place that our minds could now wander; so we called it ‘Dataland’. Since then, many hundreds of programmers have found their way through Dataland; usually heading onto our bigger, stronger but possibly less romantic environments. A few linger to test out new ideas, new platform features or better ways to do things we have done in the past; but ultimately the dream of being able to play with data is no longer a dream: It is the reality for hundreds of people every day.

    The question naturally arises, at least to those of us with very short attention spans, what if we could abstract ourselves a level further? What if we could leave behind the world of tables and joins and instead operate in a world where the knowledge just ‘was’? Where questions could be asked and answered; knowledge derived and, somehow, all the data just fell into line? That is the promise of KEL the Knowledge Engineering Language being produced by Eric Blood, Sr Consulting Software Engineer, and his team.

    Once again, I am in the privileged position of being among the first programmers to use KEL. As I was hammering out some of my first programs, often with the cursing and mutterings of someone that is well beyond laceration edge, I realized that the language was changing the way I thought about the data. This was not just a way to navigate Dataland faster; I was entering a new place: GraphLand.

    Most of my early travails in Dataland have been lost in the sands of time, a stream of adrenaline or the fog of amnesia. As the situation is now (slightly) more leisurely, I thought it might be worth recording some of my early adventures in GraphLand with the hope of paving the way for those that follow.

    Naturally big data gets exciting because it is big and real-world; Jo Prichard, Sr Architect and Data Scientist, is exploring some big and very interesting datasets housed within LexisNexis, as we speak. I have given myself a rather different task; I am creating and using a micro-dataset so that I can hand-compute, test and visualize exactly what is happening. My experience is that things that are understood in microcosm can then be intuited at scale.

    Sadly, before GraphLand can be truly explored, a bridge needs to be built from Dataland to GraphLand. Much like the cooking shows, where they suddenly announce: “here is one I prepared earlier”, many writings on graph systems assume that somehow the graph magically sprung forth into existence. In contrast, I aim to spend the rest of this blog describing one such bridge (albeit a simple one).

    We are going to start with two datasets; one a simple table of people, and the other one a table indicating relationships between them. In our Dataland model we have given all of our entities unique identifiers (indeed, a huge amount of the work done in Dataland is precisely to manufacture these unique identifiers). For this micro-dataset I have allocated nice small numbers by hand.



    Within KEL I now need to do two things; tell KEL what my GraphLand entities will look like; and tell KEL where in Dataland to find them. For such a simple data model the first is almost trivial:

    Person := ENTITY(FLAT(UID,Name,INTEGER Age));
    Relationship := ASSOCIATION(FLAT(Person who, Person whoelse));

    We declare that there are people, with three properties, a unique ID, a name and an age. We then declare that relationships are associations between two people. That’s the GraphLand model done.

    For the people; telling KEL where to find the Dataland version of the file is equally easy:

      USE File_Person(FLAT,Person);

    File_Person is the Dataland attribute containing my microfile; and we are telling KEL to extract people from it. The slightly harder case is the relationships. In my Dataland file I have one record for a relationship between two people. In GraphLand I want to represent the fact that either person can be the primary (who) person in that relationship. This is handled in the USE statement:

    USE File_Relat(FLAT,Relationship(who=person1,whoelse=person2)

    You can pull multiple entities and associations from a single record of a file. Some of our more complex records (such as a vehicle sale) might have half a dozen entities and relationships in one record, although the precise nature of the linkages is normally encapsulated entirely within people’s minds or within labels upon fields. This implicit structure in dataland is rendered explicit as the Dataland data is bridged into Graphland.

    Having brought the data in it, it is usually worth ensuring that nothing got lost; the easiest way is to get GraphLand to do a datadump.

     QUERY:Dump <= Person,Relationship;


    If you look at the data; not much has changed (this is good). Each field has gained an __flags column. As things get more complex, KEL will start to use those flags to capture meta-data information about each field value that is computed. There is also a __recordcount to tell us how many Dataland records attest each fact. The final thing to notice is that we have twice as many relationships due to our use of the two USE statements.

    Logically I should now declare success and finish this entry; however I don’t want to leave without giving some small inkling of why we would bother to have done all this work. So here is a modified version of the KEL query, in which I use the relationships to give myself a record in which my relations are listed.

    QUERY:Dump1 <= Person(UID,Name,Age,{Relationship.whoelse.Name}};


    For those of you back in the land of tables and joins, producing the table above from the raw data requires two joins and a project (minimum). Here in the GraphLand things are rather different.

    Why SKIPing can be bad for your query

    The most common way of filtering out records is to apply a filter to a dataset. E.g.,

    Centenarians := myDataset(age >= 100)

    However, sometimes the easiest place to decide whether or not a record is required is within the transform that creates it. ECL has the SKIP keyword to indicate that a record shouldn’t be generated from a transform1. SKIP has two different syntaxes:

    1. SKIP as an attribute on a transform

    outRecord process(inRecord l) := TRANSFORM,SKIP(l.age >= 100)

    Here the transform doesn’t generate a record if the skip condition is true.

    2. SKIP as a side-effect.

    outRecord process(inRecord l) := TRANSFORM
       somethingComplex := f(l); := IF(somethingComplex,, SKIP); := CASE(l.kind, 1=>l.x, 2=>l.y, 3=>l.z, SKIP);

    Here the transform doesn’t generate a record if the SKIP expression is evaluated. This is often used when the expression being tested is complex, and the output record is being rejected as an exception.

    However this form also comes with a cost. The problem is that because the assignment to the field next has the side-effect of potentially skipping, the code generator cannot optimize the field away if it isn’t subsequently used.

    Both forms of SKIP can also prevent the transform from being merged with subsequent activities2.

    So what are the alternatives?

    For the situations where nothing else will do (e.g., SKIP on a JOIN transform), it is much better to use the 1st form than the second – because it doesn’t prevent output fields being optimized away. But what if the filter condition is something complex and you don’t want to duplicate the ECL? You can continue to share common definitions by defining a function that returns a transform. E.g.,

    process(inRecord l) := FUNCTION
       somethingComplex := f(l);
       outRecord t := TRANSFORM, SKIP(NOT somethingComplex)
       RETURN t;

    You can also implement something similar with a complex test condition - if you can use an exception value for the case that will be skipped:

    process(inRecord l) := FUNCTION
       nextValue := CASE(l.kind, 1=>l.x, 2=>l.y, 3=>l.z, 0); // 0 isn’t otherwise a valid value.
       outRecord t := TRANSFORM, SKIP(nextValue = 0)
 := nextValue;
       RETURN t;

    For a PROJECT you are probably better off removing the SKIP from the transform, and filtering the resulting dataset instead. That will allow the filter condition to potentially be migrated through the query (skips are not currently migrated outside a transform).

    In summary… If you need to use SKIP try and use the first form. If you can replace SKIP with a filter you are likely to end up with even better code.

    (1) In some activities like ITERATE, SKIP can have the subtly different meaning of don’t process this record.
    (2) A couple of reasons why: If the activities are different then SKIP may mean something different, and when SKIP is used as a side-effect, combining transforms may cause the SKIP to be lost.

    Another year almost gone

    With the holidays almost upon us and while we start to wind down to spend some quality time with our families, I thought that it would be a good opportunity to recount some of the things that we have done during 2013.

    For starters, a major release of the platform (4.0) and several minor releases are a tribute to the hard work of the entire HPCC Systems community. The full self-paced online training content for HPCC, covering basic and advanced Thor, Roxie and ECL content, programming exercises and self-assessment materials is also a significant step up from our more traditional face-to-face training, allowing hundreds or thousands of people to learn HPCC from the comfort of their sofas. The addition of the ECL bundle functionality is another great example of a relatively simple enhancement that can go a long way to open the door for code sharing and reuse, and we will be building on this concept in 2014 to support seamless download and installation of ECL bundles from public and private repositories. We also made some good progress in integration and tooling, with the release of the technical preview of the new ECLWatch, the enhanced support for Hadoop HDFS, an improved JDBC/SQL driver and myriads of enhancements in the areas of graphical user interfaces to HPCC, charts and dashboards. Outside of the platform itself, a substantial amount of progress in our academic and research outreach programs have started to create good traction in those communities too.

    But it's not just about the past, and there are already exciting things happening in our development branch, including state of the art optimizations to better utilize the existing hardware (multiple CPU cores per node in Thor, for example), porting more machine learning algorithms to the new PB-BLAS framework for vectorized distributed linear algebra arithmetic and the ongoing work in supporting available CUDA compatible hardware for accelerated processing. And this doesn't even start to scratch the surface for what will come in 2014.

    But if you got to this point, you probably already spent too much time reading and should be going back to celebrating.

    Have a fun and safe enjoyable Holiday Season with your family, and come back for more in 2014!


    Often an ECL query will consist of a series of results or actions which should all be executed together. Historically there have been two ways of grouping actions together.

    P := PARALLEL(a1, a2, a3, a4);

    PARALLEL indicates that all of the actions can be executed in parallel. If any intermediate values are common to more than one of the actions, then the values should only be evaluated once, and the result reused by each action.

    S := SEQUENTIAL(a1, a2, a3, a4);

    SEQUENTIAL is used when actions need to be executed in a particular order – e.g., when ingesting data, processing it, and then marking the processed data as live. However SEQUENTIAL also has an additional effect which often isn’t realised. Intermediate values are not shared between any of the different sequential actions. Why?

    Say action ‘a1’ counts the records in a superfile, action ‘a2’ updates the contents of the superfile, and action ‘a3’ counts the new records in the same super file. If intermediate values were shared between the actions then the count in ‘a3’ would use the out-of date-count evaluated for ‘a1’, rather than the count of the new super file. For a similar reason, actions in a SEQUENTIAL do not share any values with expressions evaluated outside the SEQUENTIAL.

    Unfortunately this means that if the actions share some complicated processing then that work is going to be repeated. This is probably the commonest causes of code being executed twice, and something to watch out for. So how can you avoid it?

    If two sequential actions share a value which should be executed once, then marking it with : INDEPENDENT will ensure the evaluation is shared. However it is a poor solution for a couple of reasons. Firstly the ECL programmer doesn’t always know ahead of time which bits will be shared – they may not have written the code for the actions. The second reasons is that adding INDEPENDENT can also cause other code to be duplicated (because values will no longer be shared between an INDEPENDENT value and the context that expression is used in).

    One example I saw recently had several pieces of similar code, which each had some preprocessing, main processing and post processing. They had each been coded as

    myAction := SEQUENTIAL(preprocessing, mainprocessing, postprocessing);

    and all of the items were then executed in parallel:

    myActions := PARALLEL(myAction1, myAction2, myAction3);

    The problem is that the different main processing actions all used some common values, but because they were inside separate SEQUENTIALs the code was being duplicated.

    What would be a way to avoid it?
    One solution would be for each of the processes to be defined as a module:

    myAction := MODULE
      EXPORT preprocessing :=…
      EXPORT mainprocessing :=…
      EXPORT postprocessing :=…

    And then combine the different actions into another module:

    myActions := MODULE
      EXPORT preprocessing := PARALLEL(myAction1.preprocessing, myAction2.preprocessing);
      EXPORT mainprocessing := PARALLEL(myAction1.mainprocessing, myAction2.mainprocessing);
      EXPORT postprocessing := PARALLEL(myAction1.postprocessing, myAction2.postprocessing);

    And then only in the query itself combine them with SEQUENTIAL.

    SEQUENTIAL(myActions.preprocessing, myActions.mainProcessing, myActions.postprocessing);

    If you end up doing this a frequently you could use a virtual module, and simplify the whole process. (Left as an exercise for the reader…)
    So in conclusion be careful about the use of SEQUENTIAL – it can cause expressions to be evaluated multiple times. If you’re likely to combine SEQUENTIAL items, then it is worth spending some time thinking about the best way of structuring your ECL.

    P.S. Version 4.2 introduces a new keyword – ORDERED. It has the ordering requirements of SEQUENTIAL, but without the additional constraint. I’m not 100% sure it is a good idea! It is probably most useful for ordering actions which do not have anything in common – e.g. generating files and then sending emails - but use it with care. If there is any chance of a shared value which may change meaning you need to use SEQUENTIAL.

    Introduction... optimizing field usage

    This blog is going to be a bit of an eclectic mix. It will cover the background behind some of the features being added to the platform, what to watch out for when using ECL, and maybe a bit more detail about what goes on under the covers.

    The first topic I want to look at is the way fields are optimized, and the motivation behind a new optimization in 4.2….

    One of the optimizations that the ecl code generator (eclcc) performs is to reduce the number of fields being processed at each stage to a minimum. Say you have a dataset that contains names, ages and extra details:

    nameRecord := RECORD
    STRING name;
    UNSIGNED1 age;
    STRING address;
    STRING details;
    EXPORT namesTable := DATASET(‘names’, nameRecord, FLAT);

    If you want to find the name of the oldest person, you could do it as follows:

    sortedNames := SORT(namesTable, -age);
    oldest := CHOOSEN(sortedNames, 1);
    OUTPUT(oldest, { name });

    If the platform executed this exactly as is written, it would read the data for address and extra details from the disk, retain it in memory for the sort, and throw it away at the end. Instead eclcc tracks which fields are required at each stage in the query, and then uses that to minimize the record size at each stage. It translates the previous code into the equivalent of following query:

    // the following projection is combined with the disk read
    projectedNames := TABLE(namesTable, { name, age });
    sortedNames := SORT(projectedNames, -age);
    oldest := CHOOSEN(sortedNames, 1);
    OUTPUT(oldest, { name });

    (Actually it does better than that, but that’s a different topic… try it and see.)

    The field projection is performed as part of the activity that reads the data from the disk – which avoids copying and retaining a large amount of data.
    For this simple case the ECL programmer could do the work himself, but why waste the effort when it will be done for you? The real power comes when you start building up a complex library of definitions. The fields can often only be removed because of the combination of definitions that are used.

    Now let us modify our example a bit. Instead of storing the data in one file, the data is now stored in two flat files which are joined to get the information:

    rawNameRecord := RECORD
    UNSIGNED id;
    STRING name;
    UNSIGNED1 age;

    extraRecord := RECORD
    UNSIGNED id;
    STRING address;
    STRING details;

    rawNamesTable := DATASET(‘names’, rawNameRecord, FLAT);
    extraTable := DATASET(‘extra’, extraRecord, FLAT);
    EXPORT namesTable := JOIN(rawNamesTable, extraTable, =, LEFT OUTER);

    A LEFT OUTER join is used because the extra information is optional, and only present in the extra table if it is provided. This exported definition of namesTable contains the same information (with an extra id field), but the way it is created is different.

    After the field usage has been optimized by eclcc it will become something equivalent to

    projectedExtraTable := TABLE(extraTable, { id });
    EXPORT namesTable := JOIN(rawNamesTable, extraTable, =, LEFT OUTER);

    That is quite good, but in this case we don’t actually need to do the join (since we’re not using any of the information that the join fills in). The problem is that in general a JOIN generates an output record for each right record that matches a record from the left. So the JOIN cannot be removed automatically because that may change the number of times each record from the left is duplicated. To remove it the code generator needs a hint that there can be at most one match.

    And so finally to the new optimization in 4.2…. Adding ATMOST(1) onto the JOIN definition gives it the hint that it needs.

    EXPORT namesTable := JOIN(rawNamesTable, extraTable, =, LEFT OUTER, ATMOST(1));

    With this in place eclcc knows it can remove the entire join, and reduces it to the equivalent of

    EXPORT namesTable := TABLE(rawNamesTable, { name, age });

    Its effect can be even more drastic on complex graphs (as an exercise imagine what COUNT(sortedNamesTable) can be reduced to), and I have seen a significant effect on some queries.

    This is actually a common situation, where you are combining data from multiple sources. If there is a 1 to 0 or 1 relationship then include ATMOST(1) on your join. As well as providing useful documentation on the intention to other ECL programmers, you may find whole chunks of your queries disappearing because they’re not needed – especially once you upgrade to version 4.2.

    (See issue HPCC-10149 if you want more details of the optimization.)

    HPCC Summit 2013

    Last week, close to 300 people met in Delray Beach, Florida, to follow an intensive and densely packed agenda full of technology content on the HPCC Systems platform. It was our third annual HPCC Summit event and doubled its attendance from the previous year. Besides the great food, party and accommodations (yeah, I thought I should mention that too), there were many plenary and break-out sessions covering a broad range of areas including HPCC roadmap, new developments, recent enhancements, applications, integration with third party platforms and more.

    Some of the presentations were plain brilliant, providing material that would take weeks to fully digest, and many of the attendees will be watching the video recordings of them for months to come (in case you were wondering, we also recorded most of the presentations). And some very funny videos entered our video competition, with the winner implementing an HHPCC (Human HPCC): worth a watch if you're up for some light humor.

    An interesting aspect of the conference was the contest on ECL Bundles, which got the competitive juices flowing and brought very good submissions from several community members both, internal to LexisNexis and external too. Being one of the judges proved quite difficult as it was hard to define a single winner and we ended up giving prices (neat iPad minis) to both, the winner and the runner up. We also ended up disqualifying a submission from a clever to-remain-unnamed contestant who decided that it was a good idea to submit work done for his/her regular job as an entry to this contest (btw, the specific piece of work is quite sophisticated and very useful, and you will see it permeate into the platform in our upcoming 4.0.2 release).

    And speaking of contributions, as one of the ideas floated within the conference, there is an ongoing effort to create entries of ECL code samples for the Rosetta Code project, so if you have some time to kill and or a neat idea on how to implement one of the code examples in Rosetta Code in ECL, feel free to head over to their site and submit it to their side. These entries will surely be useful to people trying to get started in ECL and/or trying to learn new ECL coding tricks.

    We'll continue to make some of the material used during the HPCC Summit 2013 publicly available over the next few weeks, but if you are particularly interested in something, please do not hesitate to ask.


    Contact Us

    email us   Email us
    Toll-free   US: 1.877.316.9669
    International   Intl: 1.678.694.2200

    Sign up to get updates through
    our social media channels:

    facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

    Get Started