Quantile 3 – The parser
The first stage in implementing QUANTILE will be to add it to the parser. This can sometimes highlight issues with the syntax and cause revisions to the design. In this case there were two technical issues integrating the syntax into the grammar. (If you are not interested in shift/reduce conflicts you may want to skip a few paragraphs and jump to the walkthrough of the changes.)
Originally, the optional transform was specified inside an attribute, e.g., something like OUTPUT(transform). However, this was not very consistent with the way that other transforms were implemented, so the syntax was updated so it became an optional transform following the partition field list.
When the syntax was added to the grammar we hit another problem: Currently, a single production (sortList) in the grammar is used for matching sort orders. As well as accepting fields from a dataset the sort order production has been extended to accept any named attribute that can follow a sort order (e.g., LOCAL). This is because (with one token lookahead) it is ambiguous where the sort order finishes and the list of attributes begins. Trying to include transforms in those productions revealed other problems:
- If a production has a sortList followed by a transform (or attribute) then it introduces a shift/reduce error on ‘,’. To avoid the ambiguity all trailing attributes or values need to be included in the sortList.
- Including a transform production in the sortList elements causes problems with other transform disambiguation (e.g., DATASET[x] and AGGREGATE).
- We could require an attribute around the transform e.g., OUTPUT(transform), but that does not really fit in with other activities in the language.
- We could change the parameter order, e.g., move the transform earlier, but that would make the syntax counter-intuitive.
- We could require { } around the list – but this is inconsistent with some of the other sort orders.
In order to make some progress I elected to choose the last option and require the sort order to be included in curly braces. There are already a couple of activities – subsort and a form of atmost that similarly require them (and if redesigning ECL from scratch I would be tempted to require them everywhere). The final syntax is something that will need further discussion as part of the review of the pull request though, and may need to be revisited.
Having decided how to solve the ambiguities in the grammar, the following is a walkthrough of the changes that were made as part of commit hpcc-systems/HPCC-Platform@09146b9.
no_quantile in hqlexpr.hpp
The ECL query is represented by a graph of “expression” nodes – each has a “kind” that comes from the enumeration _node_operator. The first requirement is to add a new enumeration to represent the new activity – in this case we elected to reuse an unused placeholder. (These placeholders correspond to some old operators that are no longer supported. They have not been removed because the other elements in the enumeration need to keep the same values since they are used for calculating derived persistent values e.g., the hashes for persists.)
New attribute names in hqlatoms.
The quantile activity introduces some new attribute names that have not been used before. All names are represented in an atom table, so the code in hqlatoms.hpp/cpp is updated to define the new atoms.
Properties of no_quantile
There are various places that need to be updated to allow the system to know about the properties of the new operator:
hqlattr
This contains code to calculate derived attributes. The first entry in the case statement is currently unused (the function should be removed). The second, inside calcRowInformation(), is used to predict how many rows are generated by this activity. This information is percolated through the graph and is used for optimizations, and input counts can be used to select the best implementation for a particular activity.
hqlexpr
Most changes are relatively simple including the text for the operator, whether it is constant, and the number of dataset arguments it has. One key function is getChildDatasetType() that indicates the kind of dataset arguments the operator has, which in turn controls how LEFT/RIGHT are interpreted. In this case some of the activity arguments (e.g., the number of quantiles) implicitly use fields within the parent dataset, and the transform uses LEFT, so the operator returns childdataset_datasetleft.
hqlir
This entry is used for generating an intermediate representation of the graph. This can be useful for debugging issues. (Running eclcc with the logging options “–logfile xxx” and “–logdetail 999” will include details of the expression tree at each point in the code generation process in the log file. Also defining -ftraceIR will output the graphs in the IR format. Finally calling “p EclIR::dump_ir()” in a gdb console can be very informative.)
hqlfold
This is the constant folder. At the moment the only change is to ensure that fields that are assigned constants within the transform are processed correctly. Future work could add code to optimize quantile applied to an empty dataset, or selecting 1 division.
hqlmeta
Similar to the functions in hqlattr that calculate derived attributes, these functions are used to calculate how the rows coming out of an activity are sorted, grouped and distributed. It is vital to only preserve information that is guaranteed to be true – otherwise invalid optimizations might be performed on the rest of the expression tree.
reservedwords.cpp
A new entry indicating which category the keyword belongs to.
Finally we have the changes to the parser to recognise the new syntax:
hqllex.l
This file contains the lexer that breaks the ecl file into tokens. There are two new tokens – QUANTILE and SCORE.
Hqlgram.y
This file contains the grammar that matches the language. There are two productions – one that matches the version of QUANTILE with a transform and one without. (Two productions are used instead of an optional transform to avoid shift/reduce errors.)
hqlgram2.cpp
This contains the bulk of the code that is executed by the productions in the grammar. Changes here include new entries added to a case statement to get the text for the new tokens, and a new entry in the simplify() call. This helps reduce the number of valid tokens that could follow when reporting a syntax error.
Looking back over those changes, one reflection is that there are lots of different places that need to be changed. How does a programmer know which functions need to change, and what happens if some are missed? In this example, the locations were found by searching for an activity with a similar syntax e.g., no_soapcall_ds or no_normalize.
It is too easy to miss something, especially for somebody new to the code – although if you do then you will trigger a runtime internal error. It would be much better if the code was refactored so that the bulk of the changes were in one place. (See JIRA https://hpccsystems.atlassian.net/browse/HPCC-13434 that has been added to track improvement of the situation.)
With these changes implemented the examples from the previous pull request now syntax check. The next stage in the process involves thinking through the details of how the activity will be implemented.