What does it take to add a new activity?
This series of blog posts started life as a series of walk-throughs and brainstorming sessions at a team offsite. This series will look at adding a new activity to the system. The idea is to give an walk through of the work involved, to highlight the different areas that need changing, and hopefully encourage others to add their own activities. In parallel with the description in this blog there is a series of commits to the github repository that correspond to the different stages in adding the activity. Once the blog is completed, the text will also be checked into the source control tree for future reference.
The new activity is going to be a QUANTILE activity, which can be used to find the records that split a dataset into equal sized blocks. Two common uses are to find the median of a set of data (split into 2) or percentiles (split into 100). It can also be used to split a dataset for distribution across the nodes in a system. One hope is that the classes used to implement quantile in Thor can also be used to improve the performance of the global sort operation.
It may seem fatuous, but the first task in adding any activity to the system is to work out what that activity is going to do! You can approach this in an iterative manner – starting with a minimal set of functionality and adding options as you think of them – or start with a more complete initial design. We have used both approaches in the past to add capabilities to the HPCC system, but on this occasion we will be starting from a more complete design – the conclusion of our initial design discussion:
“What are the inputs, options and capabilities that might be useful in a QUANTILE activity?”
The discussion produced the following items:
- Which dataset is being processed?
This is always required and should be the first argument to the activity. - How many parts to split the dataset into?
This is always required, so it should be the next argument to the activity. - Which fields are being used to order (and split) the dataset?
Again this is always required, so the list of fields should follow the number of partitions. - Which fields are returned?
Normally the input row, but often it would be useful for the output to include details of which quantile a row corresponds to. To allow this an optional transform could be passed the input row as LEFT and the quantile number as COUNTER. - How about first and last rows in the dataset?
Sometimes it is also useful to know the first and last rows. Add flags to allow them to be optionally returned. - How do you cope with too few input rows (including an empty input)?
After some discussion we decided that QUANTILE should always return the number of parts requested. If there were fewer items in the input they would be duplicated as appropriate. We should provide a DEDUP flag for the situations when that is not desired.
If there is an empty dataset as input then the default (blank) row will be created.
- Should all rows have the same weighting?
Generally you want the same weighting for each row. However, if you are using QUANTILE to split your dataset, and the cost of the next operation depends on some feature of the row (e.g., the frequency of the firstname) then you may want to
weight the rows differently. - What if we are only interested in the 5th and 95th centiles?
We could optionally allow a set of values to be selected from the results.
There were also some implementation details concluded from the discussions:
- How accurate should the results be?
The simplest implementation of QUANTILE (sort and then select the correct rows) will always produce accurate results. However, there may be some implementations that can produce an approximate answer more quickly. Therefore we could add a SKEW attribute to allow early termination. - Does the implementation need to be stable?
In other words, if there are rows with identical values for the ordering fields, but other fields not part of the ordering with different values, does it matter which of those rows are returned? Does the relative order within those matching rows matter?
The general principle in the HPCC system is that sort operations should be stable, and that where possible activities return consistent, reproducible results. However, that often has a cost – either in performance or memory consumption. The design discussion highlighted the fact that if all the fields from the row are included in the sort order then the relative order does not matter because the duplicate rows will be indistinguishable. (This is also true for sorts, and following the discussion an optimization was added to 5.2 to take advantage of this.) For the QUANTILE activity we will add an ECL flag, but the code generator should also aim to spot this automatically. - Returning counts of the numbers in each quantile might be interesting.
This has little value when the results are exact, but may be more useful when a SKEW is specified to allow an approximate answer, or if a dataset might have a vast numbers of duplicates. It is possibly something to add to a future version of the activity. For an approximate answer, calculating the counts is likely to add an additional cost to the implementation, so the target engine should be informed if this is required. - Is the output always sorted by the partition fields?
If this naturally falls out of the implementations then it would be worth including it in the specification. Initially we will assume not, but will revisit after it has been implemented.
After all the discussions we arrived at the following syntax:
QUANTILE(<dataset>, <number-of-ranges>, { sort-order } [, <transform>(LEFT, COUNTER)]
[,FIRST][,LAST][,SKEW(<n>)][,UNSTABLE][,SCORE(<score>)][,RANGE(set)][,DEDUP][,LOCAL]
FIRST - Match the first row in the input dataset (as quantile 0)
LAST - Match the last row in the input dataset (as quantile )
SKEW - The maximum deviation from the correct results allowed. Defaults to 0.
UNSTABLE - Is the order of the original input values unimportant?
SCORE - What weighting should be applied for each row. Defaults to 1.
RANGE - Which quantiles should actually be returned. (Defaults to ALL).
DEDUP - Avoid returning a match for an input row more than once.
We also summarised a few implementation details:
- The activity needs to be available in GLOBAL, LOCAL and GROUPED variants.
- The code generator should derive UNSTABLE if no non-sort fields are returned.
- Flags to indicate if a score/range is required.
- Flag to indicate if a transform is required.
Finally, deciding on the name of the activity took almost as long as designing it!
The end result of this process was summarised in a JIRA issue: https://hpccsystems.atlassian.net/browse/HPCC-12267, which contains details of the desired syntax and semantics. It also contains some details of the next blog topic - test cases.
Incidentally, a question that arose from of the design discussion was "What ECL can we use if we want to annotate a dataset with partition points?". Ideally the user needs a join activity which walks through a table of rows, and matches against the first row that contains key values less than or equal to the values in the search row. There are other situations where that operation would also be useful. Our conclusion was that the system does not have a simple way to achieve that, and that it was a deficiency in the current system, so another JIRA was created (see https://hpccsystems.atlassian.net/browse/HPCC-13016). This is often how the design discussions proceed, with discussions in one area leading to new ideas in another. Similarly we concluded it would be useful to distribute rows in a dataset based on a partition (see https://hpccsystems.atlassian.net/browse/HPCC-13260).