Realtime Data Updates In ROXIE

As new ECL programmers, we’ve all been there.  Using the HPCC Systems big data technology, you’ve successfully imported a bunch of data, analyzed it using ECL, created aggregated datasets, built an index around the aggregations and written ROXIE code to deliver query results on those aggregations in sub-second time.  Cool stuff!

But now you have new data.  It’s a complete refresh of the data you used before.  You want to process the new data using the same code and see the results in ROXIE.  No problem!  Just import, analyze, aggregate, build the index….

System error: 30: SDS: Lock held SDS Reply Error : SDS: Lock held Lock is
held performing changeMode on connection to : ...

 

Whoa.  That was not expected.

File Locks

It’s worthwhile to talk a bit about file locking in HPCC Systems.

Somewhat simplified, ECL code normally locks files it needs to read from.  When an ECL job is executed via submit, one of the first things the runtime system does is ensure that all input files referenced by the code are present.  The system then locks those files to prevent other jobs from changing the data.  The locks are released when the job ends.

ROXIE queries use file locks the same way, but the lock is normally created when the query is published and released when the query is deleted.1  Locks are not seized and released for every inbound request.  Locking the indexes only once, during publishing, allows ROXIE to intelligently cache index metadata and partial results, among other things, and to avoid lock contention issues while processing a request.  That greatly increases performance.

File locks are tracked using a file’s full logical filename.  The logical filename, along with all other metadata about the file, is maintained by the Dali process.  The type and presence of a lock, and which process owns it, are among the items tracked in metadata.  Thor and ROXIE both consult Dali when they need to verify access to an existing file.

A Simple ROXIE Query

Let’s create a simple ROXIE query that we can talk about, named toy_query.  Here is the ECL code for it:

// Input parameter
UNSIGNED6 in_employee_id := 0 : STORED('EmployeeID');

// Record definition of the index file's contents
EmployeeRec := RECORD
UNSIGNED6 employee_id;
STRING40 last_name;
STRING20 first_name;
END;

// Definition of the actual index file
employeeIdx := INDEX
(
{EmployeeRec.employee_id}, // Search keys
{EmployeeRec}, // Payload
'~employee::id_lookup' // Index file path
);

// Filter for the record(s) associated with the input parameter
foundRecord := employeeIdx(employee_id = in_employee_id);

// Return the results
OUTPUT(foundRecord, NAMED('EmployeeName'));

 

This query is pretty straightforward:  It accepts an employee ID as a parameter, filters for records with that ID in an index file (employee::id_lookup), then returns any results.  The index file was created earlier using the BUILD() command in an ECL job running on Thor.

When this query is published, a file lock will be created on the employee::id_lookup logical file path in Dali.  The lock won’t be released until the query is deleted.  Because of that lock, any attempt to delete or update the index file will fail.

Now that we have a query to talk about, let’s see how we can update the data it refers to.

It Works In Dev…

There is a straightforward method to update the employee::id_lookup index file:

  1. Delete the ROXIE query.  This releases the lock on the index file.
  2. Run the Thor job that recreates the employee::id_lookup index file.
  3. Republish the ROXIE query.  Publishing re-establishes the file lock.

This is a perfectly valid method in a development environment, where you’re trying to figure out if everything is working.  If changes are needed, just rebuild everything.

Outside of a development environment, this method can be pretty unsatisfactory.  The biggest issue is that the query would be unavailable while the index file is being built.  If the index is very large, the query may be unavailable for quite some time.  The callers of the query, or your business partners, may not appreciate the downtime.

Switching Things Up With Superkeys

HPCC Systems gives you the ability to logically group identically formatted files into containers.  When those files are indexes, we call their container a “superkey” and we often refer to the member index files as “subkeys.”  With other file types, we call their container a “superfile” and refer to those member files as “subfiles.”  We are going to talk about superkeys here, but almost everything below refers to both superkeys and superfiles (mainly because they are basically the same thing).

Superkeys are pretty neat:

  • They are named with a logical filename, just like a regular index file.
  • A subkey can be a member of more than one superkey.
  • A superkey can be a member of other superkeys, so you can create a hierarchy of data.
  • Part of ECL’s Standard Library is devoted to managing superkeys and their contents.  See the “SuperFiles” section within the Standard Library Reference manual at https://hpccsystems.com/training/documentation.  (Note that the documentation refers to these containers as superfiles almost exclusively, but all of the functions apply to superkeys as well.)
  • The best part is how you read data from superkeys from within ECL.  Specifically, anywhere in code where you would type a logical pathname to refer to an index (for reading, not writing), you can substitute the logical pathname to a superkey instead.  Behind the scenes, ECL will open every index file in the superkey and give you access to all of them as if they were a single index file.  The ECL code is otherwise unchanged.

There is much more about using superkeys and superfiles in the ECL Programmer’s Guide at https://hpccsystems.com/training/documentation.

So why are we discussing superkeys?  Using a superkey, along with one or two functions from the ECL Standard Library, gives us another way of updating data.

First, let’s modify our toy ROXIE query to use a superkey rather than directly reference the index file:

  1. Delete the ROXIE query.
  2. Using the “Files” tab within ECL Watch, add the employee::id_lookup index file to a new superkey named employee::id_lookup_super.  You could do this programmatically with ECL as well, using the Std.File.CreateSuperFile() and Std.File.AddSuperFile() functions from the Standard Library.  Once you do this, you can see the new superkey appear in the file list in ECL Watch, and its contents will include the old index file.
  3. Modify the ROXIE query’s INDEX statement to read from the superkey instead:
// Definition of the actual index file
employeeIdx := INDEX
(
{EmployeeRec.employee_id},
{EmployeeRec},
'~employee::id_lookup_super' // Superkey file path
);

 

  1. Republish the ROXIE query.

If you call the query, it would show the same results as before.  So, we did some more work and got the same results.  At least we didn’t break anything….

How does this change affect the file locks?  Well, not necessarily in the most helpful way.  After publishing the query, there will be two file locks:  one on the index file and one on the superkey.  Adding an index file to a superkey creates a lock on that index file to preserve the relationship between the two.

All is not lost, though.  We can leverage some of the features of superkeys to change the order in which we prepare the new data for our query.  Our new steps are:

  1. Create a new index file using the BUILD() command in ECL in a job running on Thor, this time giving it a new, unique name.  We will use employee::id_lookup_v2 as an example.
  2. Delete the ROXIE query.
  3. Remove the original employee::id_lookup index file from the employee::id_lookup_super superkey.  You could do this from ECL using the Std.File.RemoveSuperFile() function.
  4. Add the new employee::id_lookup_v2 index file to the employee::id_lookup_super superkey.   From ECL, you could use the Std.File.AddSuperFile() function instead.
  5. Republish the ROXIE query.  Note that you can republish from the original compiled workunit; there is no need to modify the source code and compile it.

Five steps instead of three.  What does this buy us?

The primary pain point with the original idea was the amount of time it took to rebuild the index.  The rebuild had to take place after the ROXIE query was deleted, in order to release the file lock on the index file, so the query was offline for the entire rebuild time.  Using superkey management, we can create a new index file while the original query is still serving results from the old index.  Only after the new index is built do we have to take the query down — releasing the file locks — and swap out the subkeys used by the superkey.  If these steps were performed in code rather than manually, the query would be offline for only a few seconds at most.

A Few Seconds != Realtime

The title of this blog is “Realtime Data Updates In Roxie” but the best update performance we’ve seen so far is still short of that.  What we really need is a method that leaves the ROXIE query online all the time, but still allows us to update the indexes it uses.2  We may have to replace all of the data at once, or roll up individual subkeys into a single subkey for performance reasons.

Package Maps

Package maps (or “packagemaps”) were created several years ago to help address the task of updating a running ROXIE cluster with new data.  Package maps expand on the concept of a superkey, where a reference to a single named container indirectly references one or more member subkeys, and then ties all that to a ROXIE query.  The documentation for package maps can be found in the ROXIE: The Rapid Data Delivery Engine user manual at https://hpccsystems.com/training/documentation.

Earlier, we created a superkey named employee::id_lookup_super and used it in our toy ROXIE query.  That superkey is a “physical” superkey, in the sense that you can see and manipulate it in ECL Watch.  Dali maintains the metadata describing that superkey and its member subkeys, just like every other logical file in the cluster.

A package map gives you the ability to describe a superkey and its member subkeys entirely through a mapping interface.  That mapping is independent of the mapping used by ECL Watch or even the Standard Library functions devoted to superfile management.  The mapping takes the form of an XML document, where superkeys and subkeys are defined with distinct nodes and then grouped in a natural hierarchy.  A deployed package map creates new entries in Dali that ROXIE will use automatically.

Important note:  Package maps do not alter file locking behavior.  Superkeys and subkeys referenced in a package map acquire file locks just like they would if package maps were not used.

Here is a package map that could be used to describe the superkey/subkey relationship we are now using in our toy query:

<RoxiePackages>
<!-- id = name of Roxie query -->
<Package id="toy_query">
<!-- id = data package name -->
<Base id="toy_query_data_pkg"/>
</Package>
<Package id="toy_query_data_pkg">
<!-- id = superkey referenced in ECL -->
<SuperFile id="~employee::id_lookup_super">
<!-- id = actual subkey logical filename -->
<SubFile value="~employee::id_lookup_v2"/>
</SuperFile>
</Package>
</RoxiePackages

 

If you follow the id declarations and pay attention to the natural hierarchy in the XML document, you can see the mappings between a ROXIE query, a named data package, and the package contents.  The data package contains the superkey/subkey relationship.3

The data package name (toy_query_data_pkg) is a unique name, scoped to this ROXIE package map.  The name can be anything you want, within normal XML attribute naming rules.  Multiple data package nodes are allowed, and every data package can include multiple superkeys.  And, of course, every superkey can reference multiple subkeys.  Finally, you can include all of the mapping information for multiple ROXIE queries in this package map; just interleave the declarations in the same way as shown in the example.  Technically, the order in which everything is declared does not matter, but it’s a good idea to lay it all out in a readable format.

It is important to realize that superkey/subkey relationships defined by a package map do not show up in the Files view in ECL Watch.  In ECL Watch, package maps are viewed and managed in the Published Queries section instead.  The reason for this lies in the fact that package maps define not just superkey/subkey relationships, but also specifically the relationships of that data to ROXIE queries.  The Files tab in ECL Watch shows a global view of the data, while package maps show a ROXIE-specific view.

Related to the previous item, note that there is actually no requirement that a subkey in a package map be a member of a physical superkey (one that appears in ECL Watch).  The superkey/subkey relationship in a package map is defined solely for ROXIE’s benefit.  The subkey can actually be a plain index file, not associated with any physical superkey.

Now that we know what package maps are, we need to talk about managing them.

There are three mechanisms for working with package maps:

  1. ECL Watch.  In the Published Queries tab there is a sub-tab named Package Maps that exposes buttons for managing package maps.  The assumption here is that your package maps are saved as actual XML documents on your local drive, and they would be uploaded through the web interface to be made active.
  2. The ECL command line tool.  This tool has a packagemaps subcommand that provides the same management capabilities as offered by ECL Watch.  It also assumes that package maps are stored as physical XML documents on your local drive.
  3. SOAP endpoints.  Like most management functions in an HPCC Systems cluster, there are SOAP endpoints for managing package maps.  In fact, both ECL Watch and the command line tool use the SOAP interfaces to perform their work.  To see the WSDLs and sample input forms, simply tack /WsPackageProcess/ onto the end of your ECL Watch URL (e.g. http://localhost:8010/WsPackageProcess/).

Let’s put all this package map stuff to the test.  Our toy ROXIE query is currently referencing a superkey named employee::id_lookup_super that includes a subkey named employee::id_lookup_v2 as a member.  Let’s assume we use a Thor job to create a new subkey, imaginatively named employee::id_lookup_v3.  Now we’ll create a package map file named toy_query_packagemap.xml on our local drive to represent the new mappings:

<RoxiePackages>
<!-- id = name of Roxie query -->
<Package id="toy_query">
<!-- id = data package name -->
<Base id="toy_query_data_pkg"/>
</Package>
<Package id="toy_query_data_pkg">
<!-- id = superkey referenced in ECL -->
<SuperFile id="~employee::id_lookup_super">
<!-- id = actual subkey logical filename -->
<SubFile value="~employee::id_lookup_v3"/>
</SuperFile>
</Package>
</RoxiePackages>

 

This package map tells ROXIE that the query toy_query uses one data package (toy_query_data_pkg). That data package contains one superkey (employee::id_lookup_super), and that superkey contains one subkey (employee::id_lookup_v3).

We’ll use the ecl command line tool to upload and activate the package. Here is the command:

 ecl packagemap add --activate roxie toy_query_packagemap.xml 

 

And… it works!  We’ve updated the toy query with a brand-new index without taking the query offline!  We can also delete the now-unused employee::id_lookup_v2 index file without worry.

This Is Way Too Simplistic

If the programming world was full of toy queries, we’d be finished.  Many tutorials and programming documentation fall down at this point as well, believing that by covering the basic theory, you can figure out the more complicated stuff without a problem.

Let’s talk about real-world complications:

  • “Real” ROXIE queries are often far more complicated, sometimes using multiple indexes.  What happens if I want to update the data for just one index?
  • ROXIE can support hundreds of queries.  What if multiple queries access the same index?  What about some massive N x M combination of query and index dependencies?
  • What if a ROXIE query needs to be changed to use more, or fewer, or just different, indexes?  And that query needs to stay live the whole time?  Just to complicate things further, is it possible to retain the ability to quickly roll back to the old query if the new one doesn’t work out?
  • Superkeys are sometimes periodically updated with delta data, which take the form of additional subkeys.  A superkey can realistically contain only so many subkeys before performance degrades, at which point all of the subkeys need to be rolled up into a single subkey that will then replace all of those individual subkeys.  How do you do that while keeping the ROXIE query (or queries!) live?
  • Thor can also leverage index files for some operations, like JOIN.  How can we make sure it uses the same superkey contents as used by ROXIE when package maps apply only to ROXIE?

Believe it or not, package maps can help address all of those concerns.  But it would be helpful if the package map system were just a bit smarter….

Package Maps Take II: Package Map Parts

What was discussed earlier about package maps is really the original implementation.  It has some important limitations, but the major one was that any change to anything in the package map would require you to rebuild the entire thing and resubmit it.  That means retaining archived copies of package maps and editing them, or otherwise somehow reconstructing everything you needed to know on the fly.  In larger cluster environments, that quickly becomes quite complex and error prone.

To help address these limitations, “package map parts” were introduced in HPCC Systems version 6.0.0.  The executive summary for this new capability is that you can basically break up a monolithic package map into separate chunks (parts) and manage each part completely independently.  What a part actually contains is up to the programmer.  A part can contain only a single relationship mapping for one superkey or all of the mappings needed for all queries in a ROXIE cluster, just like with an original monolithic package map.

For maximum flexibility, we recommend4 that a package map part should define one of two things:

  • A mapping between a single ROXIE query and the data package(s) it uses.
  • A single data package defining one superkey and all of its member subkeys.

Using that recommendation, we should be able to break things down into manageable components:

  • For each superkey used by any ROXIE query, create a package map part that defines a data package.  To make it easier to manage things programmatically, generate a data package name that is based on the superkey’s logical filename (don’t use the exact superkey name, as it just looks confusing).  Only one package map part per superkey needs to be created, even if the superkey is used by multiple queries.
  • For each ROXIE query, create a single package map part that defines the mapping between that query and all of the data packages (superkeys) it needs.  Use the same naming scheme for the data package as used in the previous item.

Going back to our (very simple) toy query example, the monolithic package map would be broken out into two package map parts:  One for mapping the query to a data package, and one describing the contents of the data package.  Note that a part’s XML layout is nearly identical to a package map XML layout.

<!-- Part describing mapping between query and data package -->
<RoxiePackages>
<!-- id = name of Roxie query -->
<Package id="toy_query">
<!-- id = data package name -->
<Base id="employee_id_lookup_super_data_pkg"/>
</Package>
</RoxiePackages>

<!-- Part describing data package contents -->
<RoxiePackages>
<!-- id = data package name -->
<Package id="employee_id_lookup_super_data_pkg">
<!-- id = superkey referenced in ECL -->
<SuperFile id="~employee::id_lookup_super">
<!-- id = actual subkey logical filename -->
<SubFile value="~employee::id_lookup_v3"/>
</SuperFile>
</Package>
</RoxiePackages>

 

Beyond just breaking the old package map apart, we also changed the name of the data package.  It is now employee_id_lookup_super_data_pkg, which is something that can be derived programmatically from the superkey name (~employee::id_lookup_super).  Remember, the data package name must match in both package map parts, or ROXIE won’t be able to find its data.

Parts have their own set of management functions, separate from the package map management functions we talked about earlier.  You can manage parts with the same three tools — ECL Watch, the ecl command line tool, and SOAP endpoints — but the subcommands and endpoints are slightly different.

Assuming that those two parts are sitting on your local drive as separate XML files, as toy_query_map.xml and employee_lookup_pkg.xml respectively, we can use the following commands to upload and activate them:

ecl packagemap add-part roxie my_pkg toy_query_map.xml
ecl packagemap add-part roxie my_pkg employee_lookup_pkg.xml
ecl packagemap activate roxie my_pkg

 

Package map parts still need to be defined within a package map.  In this example we named that package map my_pkg.  If you inspect the package maps from within ECL Watch after running the above commands, you will find the parts under that name.

So now we have a package map (my_pkg) deployed to the cluster.  That package map serves as the container for any other part we define. We can add new parts, replace an existing part (thereby updating the mappings within it), or delete a part.  All of the original package map functions work as well, at the my_pkg level, so we can disable or delete the entire package map with one action if we choose.

It’s easy to see that, if the package map parts are defined at the right granular level, they become building blocks.  For example, if a new ROXIE query that uses an existing superkey is published, all you need to add is a single part that defines the mapping between that query and the already-existing data package wrapping the superkey.  To update the data used by both queries (without taking either query offline!) you need to update only the single package map part defining the data package.

Do We Have The Final Solution Yet?

A rather cynical rule of thumb is, “if a headline poses a question then the answer is usually no.”  That’s true here as well, thus proving that cynicism sometimes has its uses.

We posed five real-world complications earlier.  Two of them are completely solved just by using package map parts, providing the parts are defined at the recommended granularity of detail:

  • “Real” ROXIE queries are often far more complicated, sometimes using multiple indexes.  What happens if I want to update the data for just one index?
  • ROXIE can support hundreds of queries.  What if multiple queries access the same index?  What about some massive N x M combination of query and index dependencies?

​​Both of these complications can be solved by treating the package map parts like Lego pieces to construct data scaffolding for ROXIE queries.  With the scaffolding in place, any update becomes a relatively simple matter of 1) building the new subkey(s) and then 2) updating a package map part.

The remaining three real-world complications are a little more, well, complicated.  Package map parts do provide the architectural underpinnings for each solution, but a little more knowledge and a little more work is needed.  Let’s review one of the scenarios:

  • What if a ROXIE query needs to be changed to use more, or fewer, or just different, indexes?  And that query needs to stay live the whole time?  Just to complicate things further, is it possible to retain the ability to quickly roll back to the old query if the new one doesn’t work out?

Most of the solution for this scenario falls out of understanding how ROXIE integrates package map information during its file resolution actions.

When a package map or part is updated, ROXIE applies its contents to the current suite of published queries and makes any necessary adjustments (metadata and cache updates, file locking or unlocking, etc).  The same adjustments are applied when a query is published or unpublished, again after consulting any package maps in Dali.

The consequences of having an information mismatch between what ROXIE queries expect and what package maps supply can be detrimental in some cases.  A query may fail to resolve a superkey and may then either deliver empty results5 or return an error.  Package maps can, however, supply more information than what is needed.  For instance, they can include definitions and mappings that don’t apply to any active query, or provide more data package definitions than are strictly needed.  A validation check would show a warning in that case, but it is not an error.

Unrelated to package maps, recall how ROXIE queries are named and published.  One way to create a query is to target ROXIE in your favorite IDE and compile (not submit) the query’s ECL code.  At that point you can publish the compiled workunit and make it active.  If a query with the same name was already active, the old query will be disabled and the new query will immediately be made active.  Both queries still exist in ROXIE, but only one is handling requests.

Now we have the pieces for the solution:

  1. Create the new version of your ROXIE query, reusing superkeys from the old query, dropping superkeys, or referencing new superkeys.  Compile this query, but don’t publish it yet.
  2. Create and submit package map parts representing any new superkeys (data packages).
  3. Modify the package map part that maps the query name to the data packages it uses.  What you want to do here is include every data package used by both the old and new queries.  Submit this part so that it becomes active.  ROXIE will resolve the new part against the current query set, including the original query, and find everything it needs.  The original query will continue handling requests, unchanged.
  4. Publish the new ROXIE query under the same name as the old query.  ROXIE will find everything it needs to resolve the new query’s superkeys in the package map.  There may be more there, like superkeys used in the old query and not the new query, but that’s okay.  Because the new query has the same name, the old query will be disabled and the new one will take over its requests right away.
  5. If the new query turns out to be a dud, you can reenable the old version without recompiling or republishing.  ROXIE will work out the old superkey dependencies based on the package map information and reactivate the old query without question.

You Missed A Couple Of Use Cases

Patience!  We’re getting to the good part.  Here are the last two real-world scenarios:

  • Superkeys are sometimes periodically updated with delta data, which take the form of additional subkeys.  A superkey can realistically wrap only so many subkeys before performance degrades, at which point all of the subkeys need to be rolled up into a single subkey that will then replace all of those delta updates.  How do you do that while keeping the ROXIE query (or queries!) live?
  • Thor can also leverage index files for some operations, like JOIN.  How can we make sure it uses the same superkey contents as used by ROXIE when package maps apply only to ROXIE?

These two are somewhat related, and solving them involves a bit of new information.  First, however, we need to review a couple of things.

Actually creating index files is an activity that takes place in Thor.  If an index is going to be periodically appended with new information, the canonical way to do that is to use a superkey and append subkeys to it. Thor will need access to the superkey, so it needs to be a physical superkey (i.e. one created via Std.File.CreateSuperFile() or ECL Watch).  Thor also needs this access in order to periodically roll up the subkeys for performance reasons,6 or if a Thor job ever needs to use the superkey in a JOIN or some other operation.

Also, keep in mind that file locking is still an ongoing concern.  ROXIE will establish locks on all superkeys (and therefore subkeys) a published query references.  If you stop and think about that for a minute, it would seem that we’re right back to square one.  ROXIE would lock the superkey that Thor is using, which also ties up all of the member subkeys, and that makes it impossible for Thor to update the superkey.  That was the problem from the second paragraph of this blog post.  Actually, it is somewhat worse7 if you consider the real-world complication of having multiple queries accessing multiple, overlapping superkeys.

So here is the new bit of information that is the key to the solution:

Superkeys in package maps don’t have to actually exist.

Remember, only ROXIE sees package maps.  The rest of the cluster has no notion of a package map and, therefore, doesn’t care what’s in it.  Well, Dali is technically the reader and parser of package maps, but it just processes XML.  ROXIE is the process that actually cares about what is in a package map.

When ROXIE resolves file dependencies between queries and package maps, it uses the logical filenames found in each to match things up.  An INDEX() declaration in a query references some logical filename, so ROXIE picks it up and tries to find a corresponding logical filename in a package map.  If it finds one and it’s listed as a superkey then all of the subkey logical filenames are collected as well.  File locks are instantiated on everything collected.

Consider the package map part we last used for our toy query:

<!-- Part describing data package contents -->
<RoxiePackages>
<!-- id = data package name -->
<Package id="employee_id_lookup_super_data_pkg">
<!-- id = superkey referenced in ECL -->
<SuperFile id="~employee::id_lookup_super">
<!-- id = actual subkey logical filename -->
<SubFile value="~employee::id_lookup_v3"/>
</SuperFile>
</Package>
</RoxiePackages>

 

When that part is processed by ROXIE, two logical filenames will acquire locks: ~employee::id_lookup_super and ~employee::id_lookup_v3.  In that example, ~employee::id_lookup_super also happens to be a physical superkey, visible in ECL Watch.  Because file locks based on names, the actual physical superkey will be locked and no Thor job will be able to modify its contents.

If superkeys don’t need to exist, what happens if we use a made-up logical filename instead?

<!-- Part describing data package contents -->
<RoxiePackages>
<!-- id = data package name -->
<Package id="employee_id_lookup_super_data_pkg">
<!-- id = virtual superkey referenced in ECL -->
<SuperFile id="~virtual::employee::id_lookup_super">
<!-- id = actual subkey logical filename -->
<SubFile value="~employee::id_lookup_v3"/>
</SuperFile>
</Package>
</RoxiePackages>

 

We would also change the ECL in our toy query:

// Definition of the actual index file
employeeIdx := INDEX
(
{EmployeeRec.employee_id},
{EmployeeRec},
'~virtual::employee::id_lookup_super'
);

 

Keep in mind that ~virtual::employee::id_lookup_super exists only in the package map and in the ROXIE query; nowhere else.  We’ll call it a “virtual superkey.”  Here is the new state of the world:

From ROXIE’s point of view:  Some logical filename was used in ECL, and I found the same filename defined as a superkey in the package map.  Collect the subkey logical filenames and lock everything.  We’re good to go!

From Thor’s point of view:  This subkey is locked by ROXIE and by a physical superkey that I know about.  The physical superkey itself isn’t locked, so I’m free to change its contents.

From a global point of view:  One subkey that is locked by a physical superkey and a ROXIE query.  A virtual superkey is also locked by the same ROXIE query.

We now have everything behaving correctly, and we even have an unlocked physical superkey that Thor can use to manage the subkeys.  Successfully updating that superkey, with its contents locked by ROXIE, is straightforward but requires a careful coordination of steps to tip-toe around those locks.

Let’s consider a rollup scenario, where all of the subkeys in the superkey need to be combined into one subkey and then the new subkey completely replaces the old contents.  This is all in the context of a running system, where a ROXIE query is handling requests using the superkey’s current contents.  All of the following is performed by a Thor job:

  1. Read the current index data into a dataset, using the physical superkey as the logical filename.
  2. Perform any needed data preparation, then create a single new index file using the ECL BUILD() function.
  3. Create a new package map part that describes the data package for this superkey.  The virtual superkey’s logical filename should be used, and its contents should include only the new subkey just created.
  4. Send the package map part to the system.  This triggers several events behind the scenes:8
    1. ROXIE will release all of its file locks on the old subkeys.
    2. ROXIE will acquire a lock on the new subkey.
    3. The ROXIE query will immediately begin serving data from the new subkey.
  5. At this point, the contents of the physical superkey are not used (or locked) by the query.  Using the Std.File.ClearSuperFile() function, clear the contents of the physical superkey.  This removes the remaining file locks from the old subkeys (and optionally deletes the subkeys as well, depending on the arguments to that function).
  6. Using the Std.File.AddSuperFile() function, add the new subkey to the physical superkey.  This creates a new file lock on the subkey to preserve that relationship.  The new subkey is now accessible through the physical superkey for Thor.

Ta-da!

Are We There Yet?

No.

Well, actually, we are.

It should be stressed that all of the above is really just one way to (thoroughly) handle the scenario of live ROXIE data updates.  Package map parts play the key role.  The rest of this implementation is just detail, more or less, and learning how to shuffle data around file locks.

File locks are a larger problem when an HPCC Systems cluster contains only one Dali process.  Since Dali is where file locks are maintained, the chance for lock contention are relatively high, as we’ve seen.  Larger clusters tend to use at least two Dali processes, typically one for each execution environment (Thor and ROXIE, for example).  This reduces the chance for lock contention to nearly zero.

Package maps create new possibilities for data management.  One example is generational data management (where you have the concept of “current data” that is easily archived for a fixed number of generations, and then possibly restored if needed).  An ECL bundle has been written around that very concept:

https://github.com/hpcc-systems/DataMgmt

Within the source code, intrepid ECL programmers can find examples of how to manage package map parts strictly from ECL via SOAP calls (see GenIndex.ecl in that repo).

Thanks for reading.


Notes: 

1 You can force ROXIE to not lock referenced files by wrapping the logical filename in a DYNAMIC() declaration.  That really hurts performance, though.  For every query, ROXIE needs to resolve the logical filename, load metadata, initialize caches, and perform file locking.  And then of course undo all of that when the query completes.

2 As of HPCC Systems version 5.0.0, it is possible to simply add subkeys to an existing superkey used by a ROXIE query in realtime without worrying about file locks.  If you had only new data to add to the superkey, you could simply build a new index file (with a unique name) and add it to the superkey.  The live query will immediately “see” the new subkey.

3 The XSD for a package map requires that there is at least one <Subfile> node within a <Superfile> tag.  If you find yourself creating a mapping that includes a superkey with no subkeys, be sure to include a <Subfile/> tag to satisfy XSD validation.

4 As mentioned earlier, there are many ways to define package map parts.  Other schemes may yield better results, or ease maintenance, depending on specific use cases.  We believe the scheme recommended here is a good compromise between flexibility and maintenance.

5 This can happen if the query’s INDEX() declaration use the OPT keyword.

6 You don’t strictly need to group subkeys under a superkey in order to roll up the data, but it is the easiest method by far because you need only one ECL statement to gather all of the subkeys’ data.  Other methods would require you to either track the subkeys’ logical filenames yourself or discover them via some other mechanism.  Why do that when superkeys give you that capability automatically?

7 Actually, intractable.

8 Most of the “behind the scenes” activity is actually performed asynchronously.  Unfortunately, subsequent data management steps require that the asynchronous activity be completed first.  If you are writing code to do all this, you will need to insert a small (500ms) delay after sending the package map update to the cluster to give time for the asynchronous tasks to complete.