JOINS…
Different Types of Joins
Matching records from multiple data sources is one of the fundamental operations you need to process data. As you would expect ECL makes it easy – but the number of options can be bewildering. The following aims to guide you through the different options, and explain when they are appropriate.
The “simplest” way to match records between two datasets is to pair each record in one dataset with each record in the other dataset, and then filter the pairs to extract the matches. This is what JOIN(,ALL) does. Unfortunately as well as being simple it is likely to be very slow unless the datasets are very small. If the first dataset has M records, and the second N, then it will typically execute in O(MN) time. Thankfully if the match condition contains any equalities then there are much better approaches.
One of these approaches is to sort both datasets by the equality fields, and then step through the two datasets in step. You have the overhead of sorting the datasets which is typically O(MlnM) + O(NlnN), and finding the matching pairs is typically O(M+N). This is generally much quicker than the exhaustive approach above, and is the normal approach used in Thor.
Roxie uses a different implementation by default – instead of sorting the two datasets it starts by creating a hash table from the second dataset. Then each row from the first dataset is looked up in the hash table to find the matches. With the current implementation the “right” dataset is still sorted (to simplify duplicate handling and many lookups), but the order of the left is preserved. If the right dataset is relatively small then this will be significantly faster than sorting both datasets. This type of join can be selected in Thor by adding a “,MANY LOOKUP” attribute.
If Roxie uses a LOOKUP join by default why doesn’t Thor? The limitation of a LOOKUP JOIN is that the entire right dataset needs to be held in memory. That is a reasonable assumption to make for Roxie, where the datasets are generally small, but it isn’t for Thor.
So far the discussion has assumed it is easy to compare a pair of rows, but it is only efficient if the rows are present on the same node in the cluster. Different methods are used in Thor to ensure the rows are present on the same node:
- ALL joins and LOOKUP joins clone all the rows from the right dataset onto each of the nodes. This is only really efficient if the right dataset is reasonably small.
- The sorted joins normally globally sort the first dataset, and use the distribution of records created from the first sort to distribute the second dataset to match.
- An alternative is the HASH join. This distributes both datasets by a hash of the equality fields, and then performs local sorts after the datasets have been distributed.
If the ECL user knows already that the rows from the datasets must be on the same node, then they can opt to use a LOCAL variant of the join to avoid any redistribution. Sometimes, if the distribution of the left dataset is already known, the code generator will automatically distribute the right dataset to match, and then perform a local variant of the join.[+]
So in summary…
- If the join conditional has no equalities then you need to use ALL.
- If the right dataset is small you can use MANY LOOKUP.
- Otherwise use a global sort.
- You can use HASH to change the way distributions occur. (Sometimes this improves the performance.)
- If matching rows are already on the same nodes you can add ,LOCAL.
But what if you don’t know how large the right dataset is? Using a standard join will work whatever the size of the datasets, but if the right is small then a lookup join would be significantly more efficient. One way to optimize for small datasets is to use conditional code based on an estimate of the size of the dataset. Unfortunately it makes the graphs ugly and confusing, and has to be pessimistic to avoid jobs failing because of insufficient memory. Why can’t it be done automatically? That’s where the new Thor SMART join comes in.
The SMART join processing code starts off assuming the right dataset is small, but gradually falls back to implementations that can cope with larger number of records. It goes through the following sequence:
- Initially it used a MANY LOOKUP, duplicating the rows onto each node.
- If it runs low on memory it uses a hash of the equality fields to distribute each row to a particular node. It then distributes the left dataset to match that distribution and uses a LOCAL MANY LOOKUP join on each node.
- If there is still insufficient memory for the local hash table, the implementation falls back using a local join which sorts both sides on the re-distributed datasets. (Essentially a HASH JOIN).
There is little penalty from using a SMART join, but potentially a significant advantage for small datasets. If you are joining to a dataset of unknown size then it is worth experimenting by adding “,SMART” to your join. It is quite likely to become the default join implementation for Thor in a future version.
(Note, a SMART join can’t be the default implementation of a MANY LOOKUP join because a smart join may change the distribution of the left dataset . The code generator tracks the distribution of each dataset and use the information to optimize the query – those assumptions and optimizations would be broken if the dataset was redistributed.)
Pseudo Joins
There are few examples of ECL usage which are effectively joining two datasets together, but it is cunningly disguised…
- An implicit dataset lookup:
otherDataset(myfield=searchvalue)[1].x
This occurs when a dataset is being searched for a field that matches a value, and then extracting a field from the matching row. If the dataset is small you are much better off using a new DICTIONARY support. Otherwise it would be much better to JOIN against the dataset (using a LEFT OUTER join if there may be no matching element, and KEEP(1) if there might be more than one).
- Filtering by existance in a set:
myDataset(searchValue IN SET(otherDataset, value))
This code is searching to see if a value is contained in another dataset. Again if the dataset is small a dictionary is likely to be much more efficient. If it is large then an INNER JOIN (with KEEP (1)) would often be more appropriate.
Other JOIN options.
Sometimes most rows match relatively few rows in the other dataset, but there are a few instances of the match condition which result in very large numbers of records (e.g., john smith). Many of the options on a join are there to help cope with these pathological matches.
- LIMIT(n)
If there are more than n matches for a particular row then abort the job.
Note: LIMITs and ATMOSTS are only applied to the portion of the join condition that uses equalities. (Often called the hard match condition in some of the documentation.) - LIMIT(n, SKIP)
If there are more than n matches for a particular row exclude any matches for this row from the output.
- ATMOST(n)
If there are more than n matches for a particular row the treat it as if it didn’t match anything.
It is very similar to LIMIT(n, SKIP), but the difference is that for a LEFT OUTER join ATMOST will generate an output record. This allows you to return information about rows that had too many matches in the results.
It is worth adding ATMOST(1) to a LEFT OUTER join that can only match 1 record. It will allow the code generator to remove the join if it isn’t needed. - ATMOST({cond1, cond2, .., condn},n)
If there are more than n matches for a particular row, apply each of the extra optional conditions until the number of matches is below the threshold.
This allows you to apply a fairly broad join criteria for most pairs of rows, but add extra criteria if the equality fields are very common (e.g., some last names are much more common than others, so add extra conditions on first name, or other criteria). It is often used when the match condition also includes an edit distance match function to ensure the number of pairs processed by the edit distance function is reasonable. - KEEP(n)
Only generate output records for the first n rows that match the full join condition.
There are a few other options which are detailed in the language reference manual, which can help in fairly specific circumstances, but generally you should only add options for good reason.
(See http://hpccsystems.com/download/docs/ecl-language-reference)
[+] This new optimization has occasionally caused problems where ECL programmers have added a DISTRIBUTED() operation into some ECL to remove a warning, but the distribution expression wasn’t correct. They had used DISTRIBUTED(dataset, a-arbitrary-expression) instead of DISTRIBUTED(dataset) which indicates the dataset is DISTRIBUTED in an unspecified way. The moral of the story – never lie to the ECL compiler or it might come back to bite you.