Refactoring the transactional model in the DFS

HPCC’s distributed file system has the concept of SuperFiles, a
collection of files with the same format, that is used to aggregate
data and automate disk reads.

The operations you can perform on a SuperFile are the usual for every
file (create, remove, rename) and every collection (add/remove
children, etc.). And with that, the concept of transactions become
very important. If you’re adding three subfiles and the last one
fails, you want to clean up the first two from it. More importantly,
if you are deleting files and one fails, you want them back, so the
user can try again, maybe with a different strategy. All in all, what
you don’t want is to lose data.

How files are handled

In the DFS, files are tree nodes (much like Inodes) with certain
properties, in a Dali server (our central file server). The actual
data is spread over Dali slaves (FileParts) to maximise IO efficiency.

However, Dali controls much more than just files, it controls
information and how we access it. Multiple queries can simultaneously
access a file to read, but once a file has to be written, all other
queries must stop and wait. But because most file usage is temporary,
the requirement is that write-locks can only be given if there is no
other read-lock on that file.

While this is true for Thor (our data-crunching engine), it’s not for
Roxie (our fast-query engine), so some queries can fail in Roxie that
would work otherwise in Thor.

Also, when dealing with multiple files at the same time, you end up
locking them all, stopping you from ever getting a write-lock. If
you’re read-locking the same C++ object several times, then you can
change it to a write lock, but if two different objects (on the same
thread, on the same logical operation) have a read-lock, you’re stuck.
Making sure you have the same objects when dealing with the same files
on the same concept is no easy task, so problems like these were dealt
with by changing the properties directly when you were sure you could.

That led to a bloat in code (multiple repetitions of slightly similar
code), and multiple types of locks (lockProperties, lockTransaction)
that would do the same thing, only differently (if that makes any


If you back-track and analyse what a transaction is, you can see that
it solves most of the problems above. First, a transaction is an
atomic operation, where either all or nothing happens. That was
already guaranteed by the current transactional model (albeit with
some bloating). But a transaction has to be protected from the outside
world and vice-versa. If you create a file within a transaction, the
file must only exist in your transaction. If you delete a file, it can
only be physically deleted only if the transaction is successful. So,
creation and deletion of files must also be done within transactions.

Transactions also provide us with a very clear definition of a
process. A process is whatever happens within a transaction. This is
very simple, but very powerful, because now we can safely say that all
objects referring to the same file in a transaction *must* be the
same, AND all objects referring to the same file on *different*
transactions must be different. And, since transactions have their
file cache already, that’s the cure for rogue read-locks preventing

Current Work

The work that has being done for the last few months has cleaned up a
lot of the bloat, duplication and has migrated more file actions into
transactions. Also, removed the handling or properties directly
(rather than using the file API), and so on. But there’s still a lot
to do to get to a base where transactions can become first-class

The short-term goal is to make every file action to happen as part of
a transaction, but we can’t enforce all other parts of the system to
use transactions, so we had to create some temporary local transaction
to DFS functions to cover up for the loss. We could have changed the
rest to use transaction, but since not all actions are performed
within transactions, that would lead us to even more confusion.

So, until we have all file actions within transactions, each action
will have it’s own local transaction created, if none was provided.
Actions will be created, executed and automatically committed, as if
they were part of a normal transaction. That adds a bit of bloat on
its own, but once all actions are done, each code will be very simple:

...Some validation code...
Action *action = createSomeAction(transaction, parameters);

The auto-commit will only commit if the transaction is inactive.
Otherwise, we'll wait until the user calls for "commit" in her code.
Simple as that.

Long-term goals

The long-term goal is to simplify file access to a point where the API
look as clean as:

DFSAccess dfs; // rollback on destructor
dfs.addSubFile("a", "x");
dfs.addSubFile("a", "y");
dfs.addSubFile("a", "z");

or even simpler:

if (!AutoCommitDFSAccess(user).
addSubFile("a","x")) { // commits on destructor

DFSAccess is an object that knows who you are, where you are and what
you're doing. It uses the "user" object to access restricted objects,
it has an intrinsic transaction that starts on the constructor and
rollsback on the destructor, unless you commit first. Of course, you
can start and stop several transactions within the life-time of the
object, and even keep it safe as member of another class, or as a
global pointer.

It doesn't matter how you use, it should do the hard work for you if
you're lazy (or just want a quick access), and provide you with
complete control over the file-system access if you desire. That means
hiding *all* property manipulation, making sure the right logic goes
into the right place, without the need of refactoring the whole
platform, since everyone else will be using the DFSAccess API.

The main ideas to provide a simple, clean API, that makes it clear
what you're doing are:

  • Simplify DFS calls, ie. move file-system code up the API,
  • Protect file properties and Dali locks from non-DFS code,
  • Remove the concept of transaction to the user, unless they really want it,
  • Enforce the use of transactions on *every* DFS action, even if the
    user doesn't need it,
  • Provide some control over transactions if the user *really* needs it.

URI naming system

Another long-term goal (that is becoming shorter and shorter), is to
allow URI-based file name resolution. There are far too many file
naming styles in HPCC, and most of them can be used interchangeably.
For instance, "regress::mydir::myfile" is a Dali path, but if you're
resolving files locally (ie. no Dali connected), it transforms itself
into a local file. This is a powerful feature, I agree, but what
happens if I was expecting to get a Dali file, and there is a local
file that is older than the one in Dali (with the wrong contents)? The
program will not fail, as it should. Debugging problems like these
take time and produce a lot of grey hair.

The idea is to use URIs to name files. So, if you use a generic name
like "hpcc:///regress/mydir/myfile", it can mean anything, including
local files. If you specify that this needs to be a Dali file like
"hpcc://mydali/regress/mydir/myfile", then it'll fail if Dali is not
connected. Local files can be named as "file:///var/local/foo/file"
pretty much the same way other URI work. Web files can also be opened
(read-only) using normal URLs.

We're also thinking of adding more complex logic to the resolution of
files. For instance, as of today, HPCC (master) has the ability to
deal with Git files (files in a local Git repository), and archived
files (zip, tar, etc.). So, we can treat the URI as telling us what
type each file is, if the extension is not obvious enough. Both
"hpcc://mydali/mydir/" and
"file:///home/user/nas/content/live/hpccsystems/.git/dir/whatever/file" can automatically be
recognized as Zip and Git files, but also would
"hpcc://mydali/mydir/myfile?format=zip", for instance.

In order to do that, the file resolution has to be united under
another API, with logic to orthogonally resolve the different
protocols (hpcc, http, file, etc.) and file types (ecl, zip, git, xml,
csv, etc.). But that is enough material for a separate post in itself.