Sun Jul 21, 2019 9:17 am
Login Register Lost Password? Contact Us


One method for dynamically updating superkeys

Comments and questions related to the Enterprise Control Language

Wed Mar 13, 2013 12:40 pm Change Time Zone

There have been a lot of questions on the forum recently concerning dynamically updating superkeys. For those of you who aren't clear on what superkeys are: A superkey is a superfile composed only of payload indexes and (optionally) other superkeys. Payload indexes are indexes that do not reference datasets and contain all the fields you need for further work, both keyed (indexed) values and read-only fields (the payload).

The typical usage pattern is to create an initial superkey, usually as a superfile with one payload index. A Roxie query is then written to access the superkey. Everyone seems to have no problem with this part.

The difficulty surrounds the fact that Roxie usually obtains a lock/reference to the superkey for that query. While the query is published, the superkey cannot be modified. Since the usual desire is to update the superkey with new data on the fly, this can pose a challenge.

The usual answer -- which is the best-performing and most correct way of doing this -- is to unpublish the query, update the superkey, then republish the query. If your requirement is to keep that query available to an external caller at all times, then the problem becomes "how do I keep the query alive?" rather than "how do I update the superkey?" One answer to this is to have two Roxie clusters and switch between them. If the Roxie clusters are configured to copy their underlying data resources (which is the default setting) then each will have independent copies of the data. Point all your callers to Roxie cluster A, update cluster B, point all your callers to B, then update A. This technique has many advantages, such as giving you time to QA the update, roll back any problem updates, etc. without impacting callers. Another advantage of this technique is performance: You're using the system as it is designed to be used, and all the data- and code-performance optimizations are in place. The expense is that your infrastructure will be somewhat bigger.

However, there are cases where you don't need all that performance, or you can't afford the extra infrastructure, or both. Or maybe you're just doing a proof-of-concept test and you don't want to go to all the extra work.

I was rereading a section of the language reference manual and stumbled across something I'd read before hadn't used in practice. In the Scope and Logical Filenames section of the manual there is a subsection titled "Dynamic Files." It reads:

Dynamic Files

In Roxie queries (only) you can also read files that may not exist at query deployment time, but that will exist at query runtime by making the filename DYNAMIC.
The syntax looks like this:

DYNAMIC('<filename>')

For example,

Code: Select all
MyFile :=DATASET(DYNAMIC('~training::import::myfile'),RecStruct,FLAT);


This causes the file to be resolved when the query is executed instead of when it is deployed.

This formed the basis for a different method for updating superkeys dynamically.

The executive summary is: Create a empty superkey and a Roxie query that references it dynamically. Because it is marked dynamic, the query will not retain a reference to it and it therefore will not have a 'lock' on the superkey or its contents. You can therefore update the superkey's contents at will, without performing the unpublished-update-republish task.

I've found some limitations with this technique and there could be more. Specifically:

    Payload indexes are not copied to the Roxie cluster. Roxie will always "reach back" to Thor in order to read those files. If your Roxie and Thor clusters are mismatched or reside on different nodes, this will incur a network penalty.
    Depending on configuration settings, Roxie will open file handles to the payload index files at either query publish time or lazily. Either way, Roxie normally tries to keep those file handles open in order to avoid the performance penalty of reopening the files for every query. This technique prevents Roxie from doing that; the indexes will be reopened for every query. This will also impose a performance penalty, and it gets worse if there is a network penalty. This performance penalty will only get worse as the number of payload indexes within the superkey increases, so it becomes vital that some periodic task rolls up and aggregates those updates frequently.

Enclosed with this posting is a zip file containing two ECL files that demonstrate this technique.

    create_files.ecl: This builds Thor code that sets up files needed for the test. An empty superkey named '~dynamic_file_test::superfile' is created along with three payload indexes named '~dynamic_file_test::subkey_1', '~dynamic_file_test::subkey_2' and '~dynamic_file_test::subkey_3'.
    roxie_query.ecl: This is a Roxie query that references the superkey. It has no parameters and outputs three results just to show that the query worked.

To use this example, first run execute the code within create_files.ecl to build the data, then publish the contents of roxie_query.ecl as a Roxie query. Then it's play time:

    Using a web browser and the query interface (port 8002) on your cluster, submit the query. No parameters are needed.
    Using a web browser and ECL Watch (port 8010) adjust the contents of the superkey by manually adding and removing payload indexes. Go back to the first step and try the query again. Notice that you don't have to unpublish the query before manipulating the contents of the superkey.

While I didn't supply any code for automatically generating updates to the superkey, I wouldn't think that creating such code would be difficult for any decent ECL programmer.

The only real oddity I've found with this technique involves an empty superkey. In that case, the Roxie query returns nothing rather three empty results. There may be other strange behaviors, but I haven't run into them yet.

Cheers,

Dan
Attachments
DynamicSuperfiles.zip
(1.19 KiB) Downloaded 480 times
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 555
Joined: Tue Oct 18, 2011 4:45 pm

Thu Mar 14, 2013 9:06 am Change Time Zone

Hi Dan,

Good work with the that DYNAMIC stuff :geek:

I have used the 'packages' approach to push the newly created INDEXes to Roxie.

Actually, I was re-directed to this post from the below one :

https://hpccsystems.com/bb/viewtopic.php?f=8&t=830&p=3725#p3725

I tried your approach in my the context of my requirement and it worked. Still, I have the following queries:

    Can you provide a comparison - Packages v/s DYNAMIC files ?
    I didn't get the below part - can you elaborate :

The only real oddity I've found with this technique involves an empty superkey. In that case, the Roxie query returns nothing rather three empty results. There may be other strange behaviors, but I haven't run into them yet.


Thanks and regards !
prachi
 
Posts: 46
Joined: Mon Jul 23, 2012 11:50 am

Thu Mar 14, 2013 11:51 am Change Time Zone

Can you provide a comparison - Packages v/s DYNAMIC files ?

This dynamic technique is simpler to set up and use, but has a runtime performance penalty and requires a fairly standard Thor/Roxie configuration. Personally, I think it's a perfectly good solution providing your requirements fit within those constraints.

I admit to not fully understanding packages. Version 3.10.2 suffered from some problems with package management that prevented me from easily experimenting with that feature (you couldn't easily delete mistakes, basically). Version 3.10.4 was just released and it addresses those problems, so I should go back and experiment some more. From what I understand, though, packages should provide a performant method for updating superkeys but at the expense of increasing subkey management complexity. Specifically, packages seem to update the superkey/subkey relationships for queries but they do not update the superkeys and subkeys themselves. That means that it is not easy to see and manage the superkey/subkey relationships through another tool, such as ECL Watch. I could be wrong about that, however; that's what I want to experiment with.

The only real oddity I've found with this technique involves an empty superkey. In that case, the Roxie query returns nothing rather three empty results. There may be other strange behaviors, but I haven't run into them yet.

The problem is as described. If you have a SOAP-based caller that is expecting three responses, then having an empty superkey (a superkey with no subkeys) then the response will be invalid. Instead of seeing three empty results, you'll see absolutely nothing in the response. You will get a reply to the SOAP call, it just won't be formatted correctly. That's less of a problem with a JSON interface, though, as the response isn't strongly defined anyway.

I hope this helps.

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 555
Joined: Tue Oct 18, 2011 4:45 pm

Fri Mar 15, 2013 4:43 am Change Time Zone

It is quite interesting discussion, I can already map some use cases around it. Thanks Dan for detailed explanation on Dynamic reference to superkeys.

Packages can also be good candidate on this workaround, But (In my understanding) packages may requires cluster restart. I am not sure though! but in specific cases package can keep data in memory(without having to restart the cluster, need some validation here!!!), if this is true Package can be used we can use simply as superfiles. And we can reindex files to consolidate the data when there is Roxie idle time.
Durai
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 24
Joined: Thu Jun 23, 2011 2:50 pm

Fri Mar 15, 2013 4:26 pm Change Time Zone

Thanks for this sample,

A see a restriction with DYNAMIC files; DYNAMIC superfiles are not 'locked' but published queries still retain references to superkey contents.

Which means this is not possible to replace the content of a given superkey sub file. (for a roxie published query)

Here is the create_files.ecl sample (I made some change to add subfiles to superkey)

Code: Select all
IMPORT Std;

//------------------------------------------------------------------------------

kSuperFilePath := '~dynamic_file_test::superfile';
kMaxRecordsPerSubkey := 1000;

//------------------------------------------------------------------------------

DataRec := RECORD
   UNSIGNED4   myKey;
   UNSIGNED8   myValue;
END;

CreateSubkey(UNSIGNED1 keyCount, UNSIGNED4 recordCount) := FUNCTION
   subkeyPath := '~dynamic_file_test::subkey_' + (STRING)keyCount;
   
   DataRec MakeDataRec(UNSIGNED c) := TRANSFORM
      SELF.myKey := c;
      SELF.myValue := RANDOM();
   END;
   
   subkeyData := DISTRIBUTE(DATASET(recordCount,MakeDataRec(COUNTER)));
   
   idx := INDEX
      (
         subkeyData,
         {
            myKey
         },
         {
            myValue
         },
         subkeyPath
      );
   
   
   RETURN SEQUENTIAL(BUILD(idx,OVERWRITE), STD.File.AddSuperFile(kSuperFilePath, subkeyPath));
END;

//------------------------------------------------------------------------------

SEQUENTIAL(
Std.File.CreateSuperFile(kSuperFilePath,allow_exist:=TRUE),
STD.File.ClearSuperFile(kSuperFilePath, TRUE),
PARALLEL(CreateSubkey(1,RANDOM() % kMaxRecordsPerSubkey),CreateSubkey(2,RANDOM() % kMaxRecordsPerSubkey),CreateSubkey(3,RANDOM() % kMaxRecordsPerSubkey)),
);


After publishing 'roxie_query.ecl'; you would be able to submit it until you don't 're-submit' create_files.ecl.

In case you submit 'create_files.ecl' (after a call to published 'roxie_query.ecl'); you would get this response :

dynamic superkey test Response
Exception
Reported by: Roxie
Message: Different version of dynamic_file_test::subkey_1 already loaded: sizes = 32768 32768 Date = 2013-03-15T16:14:40 2013-03-15T16:14:09


The only possibility to dynamically update superkeys: we should add new subfiles using brand new logical names.

Regards
David
janssend
 
Posts: 13
Joined: Thu May 03, 2012 9:14 am

Fri Mar 15, 2013 4:35 pm Change Time Zone

Hi David,

In my testing I found -- or thought I found -- that if the Roxie query was published when only an empty superkey was present then everything Just Worked. You could add, remove, and swap contents without any problems. If, however, a subkey was present when the query was published then Roxie would indeed latch onto that subkey and copy it (if the configuration is setup that way). The subkey would then be subject to the same restrictions as before. So the key step was ensuring that the superkey was empty when the query was published.

Does that match with your findings?

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 555
Joined: Tue Oct 18, 2011 4:45 pm

Mon Mar 18, 2013 1:26 pm Change Time Zone

Hi Dan (and thanks for you answer),

I have done what you said : publish 'roxie_query' before adding subkey file to 'dynamic' superkey file... but I get the same result when calling the published query :

dynamic superkey test Response
Exception
Reported by: Roxie
Message: Different version of dynamic_file_test::subkey_1 already loaded: sizes = 32768 32768 Date = 2013-03-18T13:06:44 2013-03-18T13:06:25

I have probably misunderstood something, but what? Do you known if there is a way to setup Roxie (using HPCC configuration manager) to force Roxie nodes to 're-load' changed dynamic files ? (I would have thought it was done by default using 'DYNAMIC') ?

To be honest, this issue is crucial for us, because we want to be able to update really often data of published roxie queries. (we may have to think about 'package' ?)

Regards.
David
janssend
 
Posts: 13
Joined: Thu May 03, 2012 9:14 am

Mon Mar 18, 2013 2:30 pm Change Time Zone

janssend wrote:Message: Different version of dynamic_file_test::subkey_1 already loaded: sizes = 32768 32768 Date = 2013-03-18T13:06:44 2013-03-18T13:06:25

I think I failed to fully read your earlier reply; my apologies.

I've seen this "different version" error pop up when Roxie copies a logical file from Thor and then something goes wrong with the tracking of that file. What's happening is that the file physically exists in the Roxie portion of the distributed file system but Dali does not know about it (it doesn't show up when you Browse Logical Files, for instance). I've found that physically deleting the Roxie version of the file usually clears the problem, which means going into each of your nodes and deleting those file parts. They should all be within /var/lib/HPCCSystems/hpcc-data/roxie/ on a standard installation. If you don't have any published Roxie queries you can simply delete that entire directory.

There is almost certainly something wrong with what I just wrote, either the diagnosis or the fix. That 'fix' is really heavy-handed and there is probably a much more elegant way to resolve the issue. Plus, I don't really know what's going on under the covers; this is just the explanation I've come up with.

If you're uncomfortable deleting files like this, there is one easy thing to try: Rename the subkeys to some other name. You won't collide with anything that's already existing if you start with a new name.

Let me know what you find!

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 555
Joined: Tue Oct 18, 2011 4:45 pm

Wed Mar 20, 2013 9:51 am Change Time Zone

Thank you Dan,

I have done some changes in my ECL scripts in order to avoid using same subfile names. It seems to work. I have to check with simultaneous roxie querie calls and superfile updating. But in my case, DYNAMIC files make roxie data update easier.

Thank you again.
Regards
David
janssend
 
Posts: 13
Joined: Thu May 03, 2012 9:14 am

Wed Mar 20, 2013 11:33 am Change Time Zone

Excellent news, David!

I think the real solution to this problem of updating superkeys on the fly lies with Packages, though. One day when I have some time I'm going to try to figure out the "recipe" for making that work. If/when I do, I'll be sure to publish that to the forums. I'm hoping, however, that someone beats me to the punch and publishes first!

Cheers,

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 555
Joined: Tue Oct 18, 2011 4:45 pm

Next

Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest

cron