Advanced ECL — Writing Power Macros in Python

ECL has all the advantages of a strongly typed language: security, efficiency, static error detection, and so forth, but sometimes we require flexibility that is difficult and awkward to implement with strong typing. In such cases, ECL has its own Macro language that uses compile-time features to dynamically produce ECL code that is then sent to the cluster for processing. Most folks, however, avoid using these Macros because they tend to be complex, cryptic, and difficult to master. It turns out that there is another way to produce the same results, using compile-time Python embedding. This allows Macros to be written in standard Python — a well known language with simple, transparent semantics. This opens up a whole new world of advanced ECL programming methods. In this article, we will describe some scenarios where Macros are useful, describe a simple boilerplate framework for Python Macros, and demonstrate a few basic examples.

Why Macros?

There are a number of reasons for using Macros in ECL:

  • Type Flexibility
  • Re-usable Functions
  • Hiding complexity from Users

Type flexibility allows processing of data whose type may be variable. Suppose we want to accept any data set as input, and do some analysis or perform certain transformations.

Often we want to write a function that can be used in multiple scenarios, but the rigid typing of ECL only allows a single format for the input and output of the function.

It is also common to want provide helper functions that make it easy for users to perform common intricate operations without duplicated code, and the associated learning and debugging.

A simple Python Macro Framework

Our approach is to use fixed portions of the ECL Macro language to surround a Python function that does the heavy lifting of producing an arbitrarily complex set of ECL code. Here is how it looks:

// A macro that takes any dataset in, and produces some output data. Additional parameters
// may be added.
EXPORT myMacro(dsIn, dsOut):=MACRO
    // Import Python for use in the macro
    IMPORT Python3 AS Python;

    // The following two lines use ECL Macro commands to produce an
    // xml string describing the input dataset's format.
    // These shouldn't need to change as long as your input is
    // a dataset.
    #UNIQUENAME(format); %format% := RECORDOF(dsIn);
    #DECLARE(xstr); #EXPORT(xstr,%format%);

    // Python function uses the 'fold' keyword to cause it to execute at compile
    // time. It can be named anything but should take at least 3 arguments:
    // the input ds name, the output attribute name, and the xml format of the input ds.
    STRING pyFunc(STRING dsname, STRING outname, STRING recxml) := EMBED(Python: fold)
        # …
        # Format an output string containing one or more lines of ECL code here.
        # …
        outStr = …
        # Return the outStr containing ECL Code
        return outStr
    ENDEMBED;

    // We call the python function use #EXPAND to render the resulting ECL commands inline.
    // Change pyFunc to match
    // the name of your python function, and add any additional arguments as needed.
    // Retain the signature of the first three arguments.
    #EXPAND(pyFunc(#TEXT(dsIn), #TEXT(dsOut), %'xstr'%));

ENDMACRO;

Now to use this Macro, I just call it like a function, except that it doesn’t return anything. It places the output in the named dsOut variable. Though the output is typically another dataset, it can really be any data type.

// Assume I have a dataset with the name myDS
myMacro(myDS, result);
OUTPUT(result);

The outputted result is the end result of executing the ECL code emitted by the python function.

Python Macro Examples

Now let’s look at some examples of python code that do different things. Note that
these examples only show the python function. They should be nested between the
Macro commands as shown above.

Example 1

In this example, we’ll write ECL that processes the input dataset and returns an output dataset of a different format. Specifically, we will add a sequential ‘id’ field to whatever dataset we got as input. This illustrates outputting multiple lines of ECL code. Don’t forget the ‘fold’ keyword.

STRING pyFunc(STRING dsname, STRING outname, STRING recxml) := EMBED(Python: fold)
    # First we'll output the format, then the dataset.  Note that in python formatting,
    # a double curly brace ({{ or }}) is used to put a literal curly brace into the
    # string.  Otherwise it is interpreted as a tag start or end when formatting
    # using string.format(...).  We ignore the incoming xml format in this example.

    # Format the ECL you want to execute.  This can contain
    # any number of lines, and ECL commands.
    # Any created attributes should contain the outname prefix
    # so that there is no conflict with other calls. 

    # First define the new format.  Use outname_format for the attribute name.
    # Note the use of python """ to delineate multi-line strings.
    # Python's string.format() replaces curly brace enclosed tokens
    # e.g. {token} with actual values.  To insert a literal curly brace,
    # such as in the third line below, we just double the braces (i.e. {{ or }})
    formatStr = """{out}_format := RECORD
      UNSIGNED id;
      {{RECORDOF({dsname})}};
    END;
        """.format(out=outname, dsname=dsname)

    # Now we'll use PROJECT to add the id field.
    dsStr = """{out}:= PROJECT({dsname}, TRANSFORM({out}_format,
                            SELF.id := COUNTER,
                            SELF := LEFT));
        """.format(out=outname, dsname=dsname)
    # Combine the two commands.
    outStr = formatStr + dsStr

    # And return the resulting ECL.
    return outStr
ENDEMBED;

Example 2

Output the names of the fields in the input dataset as a SET OF STRING. This example illustrates how to process the XML metadata describing the input dataset.

STRING pyFunc(STRING dsname, STRING outname, STRING recxml) := EMBED(Python: fold)
    # Import python module to parse the XML
    import xml.etree.ElementTree as ET
    
    # Find the root element of the xml.  
    root = ET.fromstring(recxml)
    fieldNames = []

    # When we iterate over the root, we get each field.
    for field in root:
        # Extract the label and type as attributes from the XML.
        attribs = field.attrib
        fname = attribs['label']
        ftype = attribs['type']
        # For this demo, we are just outputting the labels
        fieldNames.append(fname)

    # Format the ECL you want to execute. In this case, we
    # are just setting a SET OF STRING attribute.
    # Outputs: <outname> := ['field1', 'field2, ...];
    outStr = """{out} := {fields};
        """.format(out=outname, fields=str(fieldNames))
    return outStr
ENDEMBED;

Example 3

Similar to Example 2. except we will define a format containing the Name and Type of each field, and output a dataset with that format.

STRING pyFunc(STRING dsname, STRING outname, STRING recxml) := EMBED(Python: fold)
    # XLM Processing is the same as previous example, except we will use both the
    # label and type attributes.
    import xml.etree.ElementTree as ET
    root = ET.fromstring(recxml)
    fieldTuples = []
    # Iterate over the fields in the xml.
    for field in root:
        # Extract the label and type from the XML
        attribs = field.attrib
        fname = attribs['label']
        ftype = attribs['type']
        # For this demo, we store the name and type as a tuple
        fieldTuples.append((fname, ftype))

    # First we'll output the format, then the dataset.
    formatStr = """{out}_format := RECORD
      STRING fieldName;
      STRING fieldType;
    END;
        """.format(out=outname)

    # Make list of strings "{'fname', 'ftype'}" for each field.  Don't forget the
    # single quotes to make the ECL strings.  Recall that double curly braces are
    # just a literal for single curly brace.
    dsContents = ["{{'{name}', '{type}'}}".format(name=t[0], type=t[1]) for t in fieldTuples]

    # Now we'll create an inline dataset using the contents from above.
    # outname := DATASET([{fname1, ftype1},{fname2, ftype2},...], outname_format);
    dsStr = """{out}:= DATASET([{recs}], {out}_format);
        """.format(recs=','.join(dsContents), out=outname)

    # Now we'll return the two ECL commands
    outStr = formatStr + dsStr
    return outStr
  ENDEMBED;

Example 4

This is a more sophisticated example, with real-world application. We will transform any record oriented dataset into a cell-oriented dataset, where each cell is known by a recordId, a fieldId, a numerical value, and a text value. Any given field will use one value or the other. Non-blank text values override the numeric value. This allows us to have a cell-oriented format that can store any field values.

We assume the first field is an id field, unless an id field is explicitly passed using the idfld parameter.

Furthermore, we allow filtering the input dataset with a set of field names to include in the output using the ‘dataflds’ argument.

In addition to transforming the dataset, we also produce a second output variable: outname_fields, that contains an ordered list of the field names.

Note that this macro has two extra string parameters. Those will need to be added to the MACRO definition, and the call to pyFunc at the bottom of the Macro.

STRING pyFunc(STRING dsname, STRING outname, STRING recxml, STRING idfld='', STRING dataflds='') := EMBED(Python: fold)
    import xml.etree.ElementTree as ET
    values = []
    textVals = []

    # Parse the XML description of the data
    root = ET.fromstring(recxml)
    fnum = 0

    # Id field to use
    foundIdField = ''

    # Clean up the designated id field if present.
    idfld = idfld.strip().lower()
    validFields = []

    # Turn the data fields input into a clean list
    if dataflds:
        validFields = dataflds.split(',')
        validFields = [field.strip().lower() for field in validFields]
    fieldOrder = []

    # Iterate over the fields in the xml.
    for field in root:
        attribs = field.attrib
        # Use this field if no datafields specified, or this field in list.
        if not validFields or attribs['label'].lower() in validFields:
            ftype = attribs['type']
            # Handle numeric and textual fields
            if ftype in ['unsigned','integer','real','decimal','udecimal']:
                # Numeric.  Add a real numeric value, and a null string value
                values.append('(REAL8)LEFT.' + attribs['label'])
                textVals.append("''")
            else:
                # Text value.  Add a real text value and a null (i.e. 0) numeric value.
                values.append('0')
                textVals.append('(STRING)LEFT.' + attribs['label'])
		# Add this field to the fieldOrder list.
            fieldOrder.append(attribs['label'])
            fnum += 1
	  # Use the first field as the id field, unless another field was specified.
        if (idfld == '' and not foundIdField) or attribs['label'].lower() == idfld:
            foundIdField = attribs['label']
    # Define the format
    formatStr = """{out}_format := RECORD
          	UNSIGNED recordId;
          	UNSIGNED fieldId;
          	REAL numValue;
        	STRING strValue;
        END;
        """.format(out=outname)
    # Use normalize to produce one record per field
    # Note that the CHOOSE(COUNTER, val, val,...) causes the appropriate field value to be
    # extracted for each field number.
    valStr = "CHOOSE(COUNTER, {values})".format(values=','.join(values))
    textValStr = "CHOOSE(COUNTER, {values})".format(values=','.join(textVals))

    dsStr = """{out} := NORMALIZE({dsname}, {numFields}, TRANSFORM({out}_format,
                                          SELF.recordId := LEFT.{idField},
                                          SELF.fieldId := COUNTER,
                                          SELF.numValue := {valStr},
                                          SELF.strValue := {textValStr}));
        """.format(out=outname, dsname=dsname, numFields=str(len(values)),
                    idField=foundIdField, valStr=valStr, textValStr=textValStr)
    # Add the outname_fields attribute with the ordered set of fields.
    fieldsStr = """{out}_fields := {fields};
        """.format(out=outname, fields=str(fieldOrder))

    # Combine the three commands and return them.
    outStr = formatStr + dsStr + fieldsStr
    return outStr
ENDEMBED;

Wrap-up

Anyone who has used the ECL Macro language will acknowledge that the python code is much easier to write, more expressive, and easier to understand and maintain. This opens the doors to arbitrarily sophisticated macros, and expands the macro capability to a wider audience. This adds to the power of ECL by reducing the need to duplicate code, improving programmer interfaces, and allowing sophisticated data pre-processing.