Sat Aug 18, 2018 2:35 pm
Login Register Lost Password? Contact Us


Uses, abuses and internals of the EMBED feature

Share ideas, code, best practices and techniques with other community members

Thu Nov 13, 2014 1:59 pm Change Time Zone

In HPCC Systems release 5.0, we added SqLite and MySQL to the list of languages supported which already included C++, Java, Python, Javascript and R. We have also now extended the functionality of the EMBED feature to include streaming.

Getting started with the basics
Let's look at how to use the EMBED syntax starting with the basics. The first thing to do is IMPORT the plugin for the language you want to embed. To use EMBED, declare a function (typically with parameters) and then for the body of the function where normally the ECL would go, simply type the code you want to embed in between the EMBED and ENDEMBED.

The following code illustrates a simple call to Python using the EMBED syntax to call the split function on the string that is passed to it and it will return a list corresponding to a set of string outputting ‘Once upon a time’ as separate strings:

Code: Select all
IMPORT python;
SET OF STRING split(STRING text) := EMBED(python)
  return text.split()
ENDEMBED;   
split('Once upon a time');

The use of IMPORT is similar to embedding, but it has no EMBED body. In this case, you would simply give the name of an external function you want to call. In the following example, ex2 is the name of the module and the tag is the name of the function in that module to be called. The IMPORT statement replaces the EMBED body. Note that the IMPORT keyword here should not be confused with the use of IMPORT to import other ECL modules; while their purposes are a little related, the syntax and usage are completely different.

In this example we are passing in a string and returning a dataset rather than a list:

Code: Select all
IMPORT python;
r := RECORD
  STRING word;
  UTF8 tags;
END;
DATASET(R) tag(STRING text) := IMPORT(python, './ex2.tag');
tag('Once upon a time there was a boy called Richard');

It calls the following Python code (in ex2.py) which imports the Natural Language Toolkit (NLP processing tool written in Python) to assign a grammar tag to the different parts of the sentence shown above:

Code: Select all
import nltk
tokenizer = None
tagger = None
def init_nltk():
    global tokenizer, tagger
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
    tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
def tag(text):
    global tokenizer, tagger
    if not tokenizer:
        init_nltk()
    tokenized = tokenizer.tokenize(text)
    return tagger.tag(tokenized)

The result shows the sentence by word showing the grammar tag assigned by the Python Natural Language Toolkit:

Code: Select all
Once,RB
upon,IN
a,AT
time,NN
there,EX
was,BEDZ
a,AT
boy,NN
called,VBN
Richard,NP

Whether ECL supports EMBED, IMPORT, or both depends on the target language, for example, Python supports both, but most target languages only support one or the other.

Now let’s look at embedding Java. Java is slightly harder to call because it needs to be told the name of the function and also the signature of it. This is because, unlike Python, Java has function overloading. The rules for writing Java function signatures are not especially complex, but the simplest way to determine them is to use the javap tool (part of the standard Java toolset). Here is an example of how to get the signatures out easily showing the java function signature and the types of parameters and results.

Given the following java code:

Code: Select all
import java.util.*;
public class JavaCat
{
  public static String cat(String a, String b)
  {
    return a + b;
  }
}

Compiled using the following command:

Code: Select all
$ javac JavaCat.java

You can use javap to report the signatures:

Code: Select all
$ javap -s JavaCat
Compiled from "JavaCat.java"
public class JavaCat {
public static java.lang.String cat(java.lang.String, java.lang.String);
    Signature: (Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
}

You then simply find the signature you want, paste it into your ECL code and call it:

Code: Select all
IMPORT java;
STRING jcat(STRING a, STRING b) :=
   IMPORT(java,
          'JavaCat.cat:(Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;' : classpath('/opt/HPCCSystems/classes'));

jcat('Hello ', 'world!');

The second IMPORT has the name of the function with a colon and the name of the signature. You can use a colon with attributes after it to pass additional information to the plugin – here it is used to set the Java classpath. This example calls a simple Java function which concatenates two strings and returns the result.

Getting the most out of using EMBED as an advanced user
So now you’ve seen basic usage of the EMBED syntax, but there is much more you can do as an advanced user depending on the language you are using.

Passing/returning records and datasets is significantly improved in HPCC Systems 5.0. Previously, in release 4.x you could only pass/return scalars and SETs. If you wanted to return a list or to pass in a large quantity of data, you had to turn it into a scalar first which could be awkward and inefficient. Using 5.0, you can also pass and return records and datasets. The details of exactly how datasets and records in ECL correspond to the target language features vary for each target. See the table below:

Records and Datasets.png
Records and Datasets.png (25.49 KiB) Viewed 4914 times

A note about Python generators...

Using a generator in Python is effectively the same as using lazy evaluation. So for example, if you pass in a dataset with a billion rows to some python embedded code designed to return the 3rd field of the 1st record, only the first record is evaluated and the rest are not fetched at all. This is similar to using a choosen(1) in ECL. You can return a generator from Python taking advantage of the same lazy evaluation or simply return a standard Python list if lazy evaluation is not important.

New in 5.0 – Streaming data
Users of HPCC Systems 5.x can take advantage of the ability to stream data. However, be aware that:

  • Some languages may stream better than others. For example, the API into R does not allow data to be streamed in and out. While you can still pass datasets (data frames) there is no support for nested datasets or lazy evaluation. The situation is similar for Javascript which also does not support streaming of iterators in and out.
    • You may have to do some work in the embedded code to ensure that it streams efficiently. So, if you want your data to stream efficiently out of Python, you need to return a generator since returning a list will cause all records to be pre-evaluated. In Java you would want to return an iterator since returning an array will not enable you to use lazy evaluation.
      • Efficient streaming means that data is pulled and calculated on demand.
      Since streaming varies between languages, use the following table to guide you:

      Streaming Data.png
      Streaming Data.png (28 KiB) Viewed 4914 times

      Using Transforms
      Transforms can also be used with the EMBED syntax bearing in mind that:

      • A function that returns a record can be used where a transform is expected. Your function needs to return a record of the correct type and the same type as an ECL coded transform would return. You can use that embedded function where the transform was expected. This also applies to C++ and can be useful where you can’t code your transform in ECL.
        • Typically such a function will also take one or more record parameters, but may take other parameters as well/instead. This is dependent on the ECL command for which you are providing the transform.
          • Normal rules for mapping record fields and parameters into embedded language apply.
          Let’s look at an example of advanced use of the EMBED syntax. The following example creates a “keyed join” to MySQL. Data is coming in, represented by indata and is being passed into the Join function which is doing a SELECT from a MySQL table. A record is being passed in and the project is saying call the EMBED function is called for every record in indata and eventually one record will be selected and returned from the table:

          Code: Select all
          IMPORT MySQL;
          stringrec := RECORD
             string name
          END;
          sqlrec := RECORD
             string ssn;
             string address;
          END;
          RECORD(sqlrec) MySQLJoin(RECORD(stringrec) inrec) := EMBED(mysql)
            SELECT FIRST 1 * from tbl1 where name = ?;
          ENDEMBED;
          PROJECT(indata, MySQLJoin(LEFT));

          A better way of achieving the same thing is to pass the whole dataset in and let the embedded plugin call the select for every record in the input. This is better because you can batch them up into a single transaction on the SQL which may be interesting for the update case and cannot be done using the PROJECT.

          Also, the previous example returns a single record so it only works for a one to one join whereas in the following example, a given record of the input may actually return more than one match on the output side so it is a one to many left inner join:

          Code: Select all
          IMPORT MySQL;
          stringrec := RECORD
             string name
          END;
          sqlrec := RECORD
             string ssn;
             string address;
          END;
          DATASET(sqlrec) MySQLJoin(dataset(stringrec) inrecs) := EMBED(mysql)
            SELECT * from tbl1 where name = ?;
          ENDEMBED;
          MySQLJoin(indata);

          So now you know how to use it, but before you do, be aware of the wider implications of using the EMBED syntax as there are a number of points to consider:

          Implications to Consider.png
          Implications to Consider.png (39.18 KiB) Viewed 4914 times

          What not to do…
          There is quite an overhead in using the call to EMBED. So use it to do something significant and not to do something you could have done with a couple of lines of ECL. In other words:

          • Don't be tempted to use EMBED in place of learning ECL.
          • Don't be tempted to use EMBED for trivial functions.
          The only exception to this is C++. Since ECL is translated into C++, there is no overhead in calling C++ and there are some things that are more efficiently done in C++. String manipulation, for example, is often easier to code using a procedural paradigm.

          Go ahead - Contribute to HPCC Systems by implementing a new embedded language plugin
          The embedded language features are implemented as plugins which makes it easier to add new ones. A function call uses ECL record metadata to walk records in datasets:

          • Creating a context
          • Binding each parameter
          • Invoking the function
          • Retrieving the result
          If there is a language you want to use one that is not currently implemented, the answer is simple. Make an embedded language plugin yourself and check it in. We will review it and include it in the next major release.

          This information is based on a presentation given by Richard Chapman, Vice President of Research and Development, recorded at the 2014 HPCC Systems Engineering Summit. The full recording of this presentation is available on YouTube: https://www.youtube.com/watch?v=ESXMcrNiXhQ&list=UUmySfVDlEUzlIiIdDc7oQbQ
          LAChapman
           
          Posts: 1
          Joined: Wed Oct 22, 2014 4:36 pm

          Mon Mar 06, 2017 1:08 pm Change Time Zone

          It is so informative and the attached video is not able to open. can any one give more details about this discussion like how we can create context and invoking python function.

          thanks in advance.
          nawazkhan
           
          Posts: 9
          Joined: Fri Nov 25, 2016 11:20 am


          Return to Tips & Tricks

          Who is online

          Users browsing this forum: No registered users and 1 guest

          cron