Embedded Java enhancements in HPCC Systems 7.2

Although Java was the first language that we created a plugin for (not counting embedded C++ which didn’t need a plugin), in some ways the support for using Java with ECL has been a bit more limited than some other languages, particularly Python. However, in the upcoming 7.2 release of the platform, that’s going to change.

Inline Java code

One of the most obvious new features is one that makes it far simpler to call simple Java code from ECL . Now you can use EMBED as well as IMPORT, so your Java code can be placed right in your ECL file, and the relevant classes will be compiled and distributed within the workunit and loaded as needed on demand:

  • No more manual compilations of the Java, extraction of Java signatures, ugly IMPORT statements, or working out where to copy the required .class or .jar files on your HPCC Systems clusters.
  • No more having to ask operations to deploy a new class file for you, in order to be able to test your change. 
  • No more concerns about whether the right version of the class is present for this particular ECL code, when other ECL code needs a different version.

By putting the Java code right in the ECL, and embedded the corresponding classes in the workunit, all of the above concerns disappear.

Example:

IMPORT Java;
string cat(string a, string b) := EMBED(java)
  public static String cat(String a, String b)
  {
    return a + b;
  }
ENDEMBED;

OUTPUT(cat('Hello ', 'Java'));

In the above example, just a single method is provided. The ECL compiler will add the required “boilerplate” around it to make it into compilable Java code. You should also note that (unlike when embedding C++) the Java function header itself MUST be provided and the name must match the name of the ECL embed expression (this is checked at compile time). You should also ensure that the parameters are compatible. If they are not you will get an error, but not until runtime.

If you prefer you can supply the full class definition instead:

IMPORT Java; 
SET OF STRING split(STRING s) := EMBED(Java) 

import java.util.StringTokenizer; 
import java.util.ArrayList; 

public class myClass // Any name can be used 
{ 
  public static String[] split(String inStr) 
  { 
    ArrayList<String> result = new ArrayList<String>(); 
    StringTokenizer token = new StringTokenizer(inStr); 
    while (token.hasMoreTokens()) 
    { 
      result.add(token.nextToken()); 
    } return result.toArray(new String[0]); 
   } 
} 
ENDEMBED; 

OUTPUT(split('Hello java'));

Note: This form allows you to specify any imports you may need, so is likely to be the preferred form for any non-trivial embedded Java code.

Because the compilation of the Java code is done at the same time the ECL is compiled, syntax errors from the Java compiler will be translated to refer to the appropriate line of the ECL source file.

IMPORT Java; 
STRING cat(STRING a, STRING b) := EMBED(java) 
  public static String cat(String a, String b) 
  { 
    return a + c; 
  } ENDEMBED; 

OUTPUT(cat('Hello ', 'Java'));

Will give the following errors when compiled:

ex3.ecl(5,1): error C2405: cannot find symbol 
ex3.ecl(5,1): error C2405: return a + c; 
ex3.ecl(5,1): error C2405: ^ 
ex3.ecl(5,1): error C2405: symbol: variable c 
ex3.ecl(5,1): error C2405: location: class embed

Note: It is also possible, though very rarely recommended, to delay compilation until runtime…

IMPORT Java; 

STRING myJavaCode := ''' 
  public static String cat(String a, String b) 
  { 
    return a + b; 
  } 
'''; 

STRING myOtherJavaCode := ''' 
  public static String cat(String a, String b) 
  { 
    return b + a; 
  } 
'''; 

STRING cat(STRING a, STRING b) := EMBED(Java, IF(random()%10>=5, 
myOtherJavaCode, myJavaCode)); 

OUTPUT(cat('Hello ', 'Java'));

Of course, doing so means you won’t get any compile errors reported until run time. It also means the compilation will be repeated every time the query runs which may be very inefficient.

It is also possible to execute the resulting Java at compile time, if the embed is marked with the FOLD attribute and all parameters are constant:

IMPORT java; 

major := 1; 
minor := 0; 
point := 1; 

STRING getVersion(INTEGER a, INTEGER b, INTEGER c) := EMBED(Java: FOLD) 
String getVersion(int a, int b, int c) 
{ 
  return Integer.toString(a) + '.' + Integer.toString(b) + '.' + Integer.toString(c); 
} 
ENDEMBED; 

#if (getVersion(major, minor, point)='1.0.0') 
  OUTPUT('Version is too old') 
#else 
  OUTPUT('Version is good'); 
#end

This means you can use Java to generate ECL code:

IMPORT java; 

STRING getCode() := EMBED(Java: FOLD) 
String getCode() 
{ 
  return "OUTPUT('Hello world!');"; 
} 
ENDEMBED; 

#expand (getCode())

Embedding jar files

The second significant feature is the ability to embed jar files in the workunit via the manifest. If your ECL attribute needs a jar file, create a manifest file with the same name as the attribute:

Example:

Given an ecl attribute ex7.ecl:

IMPORT java; 

STRING cat(SET OF STRING s) := IMPORT(java, 
'ex7.cat:([Ljava/lang/String;)Ljava/lang/String;'); 

EXPORT ex7 := cat(['Hello','World']);

And a Java source file ex7.java:

class ex7 
{ 
  public static String cat(String input[]) 
  { 
    StringBuffer ret = new StringBuffer(); 
    for (String item:input) 
    { 
      if (ret.length() > 0) 
        ret.append(" "); 
      ret.append(item); 
    } return ret.toString(); 
   } 
}

Compiled into a jar file using:

javac ex7.java
jar cvf ex7.jar ex7.class

And a manifest file ex7.manifest (in the same place as your ECL source file ex7.ecl):

<Manifest>
 <Resource type='jar' filename='ex7.jar'/>
<Manifest>

We can build and run the ecl code locally or submit it to a remote system. The jar file will be included in the workunit automatically and unpacked when needed.

Note that the jar will be included in the workunit only if the corresponding ECL attribute is included, and that it will be unpacked to a temporary location that is only referenced by the Java class loader for this particular workunit. So it is possible for different queries to use different versions of a jar file, and yet have both queries loaded at the same time.

Java signatures

In HPCC Systems 7.0.0 and earlier, when using IMPORT to specify a Java function to be called, it was necessary to provide the classname. method name and signature. The classname and method name should not be an issue; you would expect to know what those are. But the signature was a bit of a pain. You either need to understand the algorithm used by the Java compiler for creating method signatures, or use the javap utility to dump out the signatures for the methods in a class and paste them into your code.

Ok, the algorithm is not THAT complicated, but it’s still better to let the computer do things that computers are good at and work out the signature itself.

In HPCC Systems 7.2.0, therefore, you do not need to specify the signature UNLESS there is more than one public method with the same name in the class. You can still specify the signature if you want, or if you need to distinguish which method to call out of a set of overloaded methods with the same name.

If you are specifying the signature, you can also add ‘@’ at the front to indicate that a non-static member function is to be called rather than a static one.

Handling Java objects

One challenge when working with ECL (a functional language, without side effects) and embedding a procedural language such as Java, is handling the lifetime of objects in the embedded code. If you are only ever wanting to call “stateless” functions in your embedded language of choice then there is no difficulty, but if you need to persist some state from one call of your function to the next it becomes a little more difficult. Reasons for wanting to persist state vary. Sometimes for example it’s to avoid recalculating or reloading a data structure, where the code still acts as if it was stateless and the state is purely for efficiency. At other times you may want to use the state to return a different value each time a function is called, for example if implementing a tokenizer or accumulator. I’m going to use as an example a simple Java class that adds up the values passed to it and returns a running total:

public class JavaAccumulator 
{ 
  public synchronized int accumulate(int a) 
  { 
    total = total + a; 
    return total; 
  } 
  private int total = 0; 
}

Prior to HPCC Systems 7.2.0, there would be no way to call the JavaAccumulator.accumulate function from ECL, as only static methods could be called. You could do something like this (and you still can in HPCC Systems 7.2.0, if you want):

public class JavaAccumulator 
{ 
  public synchronized int doAccumulate(int a) 
  { 
    total = total + a; 
    return total; 
  } 
  public synchronized static int accumulate(int a) 
  { 
    if (theAccumulator==null) 
      theAccumulator = new Accumulator(); 
    return theAccumulator.doAccumulate(a); 
  } 
  private int total = 0; 
  private static JavaAccumulator theAccumulator; 
}

If we wrap the in ECL code, we can see the effect of calling it twice:

IMPORT Java; 

INTEGER accumulate(INTEGER a) := EMBED(Java) 
public class JavaAccumulator 
{ 
  public synchronized int doAccumulate(int a) 
  { 
    total = total + a; 
    return total; 
  } 
  public synchronized static int accumulate(int a) 
  { 
    if (theAccumulator==null) 
      theAccumulator = new Accumulator(); 
    return theAccumulator.doAccumulate(a); 
  } 
  private int total = 0; 
  private static JavaAccumulator theAccumulator; 
} 
ENDEMBED; 

ORDERED 
( 
  accumulate(1); 
  accumulate(2); )

This outputs the following:

1
3

Which is as expected. But there are some challenges with this approach. In particular, it is unclear what the scope or lifetime of the accumulator is. If the query is executed again:

  • Is the same accumulator still present, or has it been released?
  • If you have multiple threads, do they share an accumulator?
  • Do Thor slaves share a single accumulator?
  • What about Thor channels?

And even if you know what the answers to those questions are, they may not be the answers that you want for your particular use case. In HPCC Systems 7.2.0, therefore, we provide a bit more control over the lifetime of Java objects that may be used to maintain state between embedded Java calls.

The first change in HPCC Systems 7.2.0 is that you are now allowed to call non-static Java methods as well as static ones. If you do, a Java object will be created implicitly (a parameterless constructor, explicit or implicit, is required) before calling the method. So we can code:

IMPORT Java; 

INTEGER accumulate(INTEGER a) := EMBED(Java) 
public class JavaAccumulator 
{ 
  public int accumulate(int a) 
  { 
    total = total + a; 
    return total; 
  } 
  private int total = 0; 
} 
ENDEMBED; 

ORDERED 
( 
  accumulate(1); 
  accumulate(2);

)

However, as it stands this will create a new Java object (of class JavaAccumulator) for each call (and release it at the end of the call), so no state is preserved, and the output is as follows:

1
2

If we want to use the same Java object for multiple calls, we use the PERSIST option on the embed statement (this is analogous to the option of the same name in the Python embed code). There are several possible values:

  • PERSIST(‘Thread ‘)
    The object will be implicitly created the first time a call is made on this thread, and will be reused for all calls on the same thread.
  • PERSIST(‘Channel ‘)
    The object will be created the first time a call is made by this workunit, on this channel, and reused for all calls by the same workunit/channel.
  • PERSIST(‘Workunit ‘)
    The object will be created the first time a call is made by this workunit, and reused for all calls by the same workunit
  • PERSIST(‘Query ‘)
    The object will be created the first time a call is made by this query, and reused for all calls by the same query.
  • PERSIST(‘Global ‘)
    The object will be created the first time a call is made by any workunit/query, and reused thereafter.

Note: For all of the above, the object can only be shared within a single process, and will only live as long as that process lives. So in a Thor cluster, for example, each slave will have an independent object even if you have specified workunit or global mode. Query mode is identical to workunit mode other than on Roxie – on Roxie it may be useful to specify; PERSIST(‘Query ‘) if you have an object that has some complex initialization that you want to perform only once, prior to calling many times from multiple calls to the query.

  • The difference between PERSIST(‘Channel ‘) and PERSIST (‘Workunit ‘) is only relevant on a Thor job where the Thor cluster has been configured to have multiple channels sharing a single thor slave process. 
  • PERSIST(‘Channel ‘) will cause each channel to have a separate object, which is useful if, for example, you are using the Java object to accumulate information that you then want to roll up to a single result. 
  • With PERSIST(‘Workunit ‘) all channels on the same slave node would share the same object, which may be a good thing if your Java object is acting as a cache, but may be a bad thing if it is acting as a data source for a Thor job as the data may appear in multiple channels.

For example, if we update the above example to use PERSIST:

IMPORT Java; 

INTEGER accumulate(INTEGER a) := EMBED(Java : PERSIST('Workunit')) 
public class JavaAccumulator 
{ 
  public synchronized int accumulate(int a) 
  { 
    total = total + a; 
    return total; 
  } 
  private int total = 0; 
} 
ENDEMBED; 

ORDERED 
( 
  accumulate(1); 
  accumulate(2); 
)

Then we now get the answer:

1 
3

Notice the “synchronized” on the accumulate function definition. If using a persist mode other than ‘Thread’ then the function may end up called from multiple threads at the same time.

In conjunction with the PERSIST option that controls the lifespan and scope of the implicitly-created Java objects, it is possible to specify a string to identify that a group of Java functions should use the same object, or that different calls to the same Java function should use different objects, even if their lifetime is the same. The string can be specified either using a separate option GlobalScope, or by adding a string after the persist type, separated with a colon. For example, if you wanted two separate accumulators, you could write:

IMPORT Java; 

Accumulator(String id) := MODULE 
  EXPORT INTEGER accumulate(INTEGER a) := EMBED(Java : PERSIST('Workunit'), 
GLOBALSCOPE(id)) 
  public class JavaAccumulator 
  { 
    public synchronized int accumulate(int a) 
    { 
      total = total + a; 
      return total; 
    } 
    private int total = 0; 
   } 
   ENDEMBED; 
END; 
a1 := Accumulator('a1'); 
a2 := Accumulator('a2'); 

ORDERED 
( 
   a1.accumulate(1); 
   a1.accumulate(2); 
   a2.accumulate(3); 
   a2.accumulate(4); 
   a1.accumulate(5); 
)

Which will return:

1 
3 
3 
7 
8

Notice the use of the MODULE to make the code clearer. It is not required to code this way but it’s probably a good convention to use.

Sharing an object between multiple calls is a little more complex. If you are using IMPORT rather, then EMBED, for example to call a Java class that has been embedded via a manifest, then it’s simple enough. Let’s add a function that resets the accumulator:

public class JavaAccumulator 
{ 
  public synchronized int accumulate(int a) 
  { 
    total = total + a; 
    return total; 
  } 
  public synchronized int clear() 
  { 
    int ret = total; 
    total = 0; 
    return ret; 
  } 
  private int total = 0; 
}

If this is compiled and added to the workunit via the manifest as described earlier, we can specify a group of functions that reference the same object using the scope string:

IMPORT Java; 

Accumulator(String id) := MODULE 
  EXPORT INTEGER accumulate(INTEGER a) := IMPORT(Java, 
'JavaAccumulator.accumulate' : PERSIST('Workunit'), GLOBALSCOPE(id)); 
  EXPORT INTEGER clear() := IMPORT(Java, 'JavaAccumulator.clear' : 
PERSIST('Workunit'), GLOBALSCOPE(id)); 
END; 

a1 := Accumulator('a1'); 
a2 := Accumulator('a2'); 

ORDERED 
( 
  a1.accumulate(1); 
  a1.accumulate(2); 
  a2.accumulate(3); 
  a2.clear(); 
  a2.accumulate(4); 
  a1.accumulate(5); 
)

This returns:

1 
3 
3 
3 
4 
8

It’s not quite so obvious how to do this when embedding the code inline. You don’t want to repeat the code (and even if you did, it would not work as each embedded class added only to that function’s classloader). However, there is one last new feature in HPCC Systems 7.2.0 that comes to our rescue. You can retrieve a reference to the implicitly-created Java object and you can pass it to a second function.

Retrieving the implicitly created Java object is done by using the name of the embedded Java class as the function name. In effect we are calling the constructor (but only if the implicit object has not yet been created). The return type should be UNSIGNED. Since Java does not support unsigned types we can recognise return values and parameters of this type as referring to Java objects.

To pass the object to a different function, we use this UNSIGNED value as the first parameter. The function being called does not specify any code, nor should it specify the class name, just the function name. For example:

IMPORT Java; 

Accumulator(String id) := MODULE 
  SHARED UNSIGNED JavaAccumulator() := EMBED(Java: PERSIST('Workunit'), 
GLOBALSCOPE(id)) 
  public class JavaAccumulator 
  { 
    public synchronized int accumulate(int a) 
    { 
      total = total + a; 
      return total; 
    } 
    public synchronized int clear() 
    { 
      int ret = total; 
      total = 0; return ret; 
    } 
    private int total = 0; 
   } 
   ENDEMBED; 
   SHARED INTEGER _accumulate(UNSIGNED o, INTEGER a) := IMPORT(Java, 'accumulate'); 
   EXPORT INTEGER _clear(UNSIGNED o) := IMPORT(Java, 'clear'); 
   EXPORT INTEGER accumulate(INTEGER a) := _accumulate(JavaAccumulator(), a); 
   EXPORT INTEGER clear() := _clear(JavaAccumulator()); 
END; 

a1 := Accumulator('a1'); 
a2 := Accumulator('a2'); 

ORDERED 
( 
  a1.accumulate(1); 
  a1.accumulate(2); 
  a2.accumulate(3); 
  a2.clear(); 
  a2.accumulate(4); 
  a1.accumulate(5); 
)

Which returns:

1 
3 
3 
3 
4 
8

Note: The calls to accumulate and clear do not need PERSIST information or class names specifying, both are controlled by the object being passed in. You can specify the expected class name if you want using the following syntax:

EXPORT INTEGER _clear(UNSIGNED o) := IMPORT(Java, 'JavaAccumulator::clear');

This will then be checked on each call. If no classname is specified in this way, any object that exports the specified function can be passed in.

In the example above, it may look as though the fact that we are calling JavaAccumulator()every time would be inefficient, or would end up creating lots of objects. But in fact because we return a reference to the already-created object for the specified scope, if there is one, it’s actually not too bad. Not that it’s an error to use this form of function without a PERSIST being specified. There’s no point returning an object which is already out of scope. There’s actually a bigger concern that we might call the function too few times, if the ECL compiler decides that the function is invariant and therefore can be precalculated. The compiler does not know about the semantics of the PERSIST options, and may need some “encouragement” to ensure that the calls are made in the proper context. For example, when used on Thor, you would not want the JavaAccumulator()call to be precalculated on EclAgent and stored as a temporary in the workunit. Tthe referenced object would be in a different process (typically on a different computer) to the Thor slaves, which would therefore fail when trying to call a method in it.

Thor usage with multiple slave processes

For example, suppose you wanted to use the JavaAccumulator class to create a running total from a dataset, using a PROJECT:

IMPORT Java; 

Accumulator := MODULE 
  SHARED UNSIGNED JavaAccumulator() := EMBED(Java: PERSIST('Thread')) 
  public class JavaAccumulator 
  { 
    public synchronized int accumulate(int a) 
    { 
      total = total + a; 
      return total; 
    } 
    public synchronized int clear() 
    { 
      int ret = total; 
      total = 0; 
      return ret; 
    } 
    private int total = 0; 
   } 
   ENDEMBED; 
   SHARED INTEGER _accumulate(UNSIGNED o, INTEGER a) := IMPORT(Java, 'accumulate'); 
   EXPORT INTEGER _clear(UNSIGNED o) := IMPORT(Java, 'clear'); 
   EXPORT INTEGER accumulate(INTEGER a) := _accumulate(JavaAccumulator(), a); 
   EXPORT INTEGER clear() := _clear(JavaAccumulator()); 
END; 

MyRec := RECORD 
  integer i; 
END; 

d := DATASET([{1},{2},{3}], MyRec); 

MyRec t(MyRec l) := TRANSFORM 
  SELF.i := Accumulator.accumulate(l.i); 
END; 

accumulated := PROJECT(d, t(LEFT)); 

OUTPUT(accumulated, {i});

This code runs fine on Roxie or EclAgent, but if you run it in Thor you will get an error:

javaembed: In method accumulate: Invalid value passed for 'this'

Looking at the generated C++ tells us why:

v3 = user1(ctx); 
ctx->setResultUInt("gl2",4294967293U,v3,8U); 
ctx->executeGraph("graph1",false,0,NULL);

The ECL code generator has decided that the call to JavaAccumulator only needs to be made once, before executing the graph. But we need it to be made on each slave. In fact, we really want it evaluated per activity. The simplest way to force the function to be evaluated locally on the slaves is to force it to be evaluated for each record of the PROJECT activity, and the simplest way to do that is to give it a parameter that depends on the current record:

IMPORT Java; 

Accumulator(INTEGER dummy) := MODULE 
  SHARED UNSIGNED JavaAccumulator(INTEGER dummy) := EMBED(Java: PERSIST('Thread')) 
  public class JavaAccumulator 
  { 
    public JavaAccumulator(int dummy) {} 
    public synchronized int accumulate(int a) 
    { 
      total = total + a; 
      return total; 
    } 
    public synchronized int clear() 
    { 
      int ret = total; 
      total = 0; 
      return ret; 
    } private int total = 0; 
   } 
   ENDEMBED; 
   SHARED INTEGER _accumulate(UNSIGNED o, INTEGER a) := IMPORT(Java, 'accumulate'); 
   EXPORT INTEGER _clear(UNSIGNED o) := IMPORT(Java, 'clear'); 
   EXPORT INTEGER accumulate(INTEGER a) := _accumulate(JavaAccumulator(dummy), a); 
   EXPORT INTEGER clear() := _clear(JavaAccumulator(dummy)); 
END; 

MyRec := RECORD 
  integer i; 
END; 

d := DATASET([{1},{2},{3}], MyRec); 

MyRec t(MyRec l) := TRANSFORM 
  SELF.i := Accumulator(l.i).accumulate(l.i); 
END; 

accumulated := PROJECT(d, t(LEFT)); 

OUTPUT(accumulated, {i});

Now we get the expected answer on Thor too. The value passed to the constructor is not used, but because it comes from the current record it forces the ECL compiler to make the call per record.

This particular “gotcha” with ensuring that the Java object is created in the right place is only an issue when using the ‘call the constructor and pass the object explicitly’ model. If you’d rather avoid the issue altogether, stick to letting the Java embed plugin manage the Java object for you implicitly behind the scenes, and all is well. Unfortunately, as of HPCC Systems 7.2.0 that means you would have to use a JAR file and IMPORT statements rather than embedding Java code inline, but perhaps that will change in a future update.

More information about using embedded languages with HPCC Systems: