Blogs

Step 1: Boot it Up!

So you are energized about HPCC Systems and ready to take the leap and "Jump In!"

There are three things you will need to do.

  1. Install the server components and boot up your own supercomputer.
  2. Load up some data.
  3. Install the ECL programming interface and "do something!"

This post only deals with Step 1: Boot it Up.

The obvious questions are...

"From a standing start, how easy is it to install and boot up the HPCC Systems platform?"
"Do I need to be a Linux expert?"
"How long does it take to be up and running?"

All good questions. The answers in order are "Easy", "No" and "Less than 5 mins" (for a one node).

When you want to scale up to a multinode environment there is some additional configuration but for the purposes of "Jumping in" you don't need to care. This is because of two key reasons:

  1. A suitable one node with enough memory and processors is almost always sufficient for initial data science projects.
  2. Any ECL you hack together will be the same regardless of whether it is a 1 node or a 1000 nodes!

So a one node is a great place to start small; scaling up from there to a massive multi-node supercomputer is easy enough.

For the purpose of booting up a single node HPCC System, you can download and boot up the HPCC Systems VM or if you have spare 64bit hardware burning a hole on your back shelf somewhere (you know you do!).

I will install it by hand on an Ubuntu 12.04 desktop so you can see for yourself, without smoke and mirrors. It is so easy even a 10 year old can install it, so my youngest daughter has kindly agreed to narrate it for me to spice things up for fun!

There are 5 simple steps to install the platform:

  1. Download the platform install from the
    HPCC Systems website.
  2. Install the platform: sudo dpkg -i hpcc...
  3. Resolve dependancies: sudo apt-get install -f
  4. Start the platform: sudo service hpcc-init start
  5. Open ECLWatch web admin tool.

There you have it! Your own supercompute platform in 5 mins or less!

Formal installation instructions can be found at Installing and running the HPCC Systems platform.

So what next? Step 2: Load some data! Stay tuned!

The shortest path to "doing big data" is HPCC Systems

For those who don't know me, here is a brief intro in the context of this blog post.

Some people say I talk a lot!? Just so you know, I am not all talk. I explore, I hack and I get it done; more often than not, I am on my own end to end on the getting it done side of things.

Einstein said, "Curiosity has its own reason for existing". I am intensely curious and I am always looking for new insights in big data. I am probably most interested as a closet sociologist; I am curious about the behaviour of people and social groups and the faint footprints in data that are left when they coordinate activities and events (good and bad events!).

I am fortunate to be surrounded by some of the brightest big data technology innovators and data scientists who were doing unbelievable big data hacks to do everything from helping find serial killers and snipers a decade ago to combating massive fraud rings right now. We are in the business of doing, getting results and ultimately helping generate revenue!

If you are big data curious and you are looking at the emerging technologies, of which there are many, it probably seems like a steep learning curve and a lot of effort just to get to the point of being able to "do something".

Do you need to learn Python, R and/or Java? What technologies do you need to install and configure to spin up a data intensive supercomputer?

All these questions are important, but they can distract from the business of doing.

The best way to get started in data science is to DO data science!

If you have big data, you are curious and you want to get stuck in, try HPCC Systems. If you don't have big data, I will show you where to download some and get stuck in.

Three reasons:

  1. One ecosystem that seamlessly gives you all the functionality of a bunch of alternative technologies.
  2. One programming language purpose built for data intensive computing, perfect for data science. ECL
  3. Only 1 person required, YOU!

There are many reasons why the shortest path to "doing big data" is HPCC Systems but that discussion is for another post and another day.

Today is about you, big data and taking the first step! Join our 'working smart' community! Regardless of whether you are a student, startup engineer or you work in an existing enterprise, HPCC Systems is a great platform to scale up from zero to hero!

Start small, download the VM and JUMP IN!

HPCC Systems platform Roadmap and Happy Holidays!

As the end of the 2012 calendar year approaches, at least for a good chunk of the world (and may come to an end on 12/21 for a crazy bunch), some people start celebrating holidays in different cultures and countries. I consider this season a good time to go over things to come in the HPCC Systems platform arena.

Our 3.8 release is out (3.8.6-4 can be downloaded from here) and, while there may still be minor bug fixes (no, there are never bugs, just "under-appreciated features" that we may want to get rid of), 3.10 is well in the works. 3.10, which should be released in the next few weeks, changes the open source license to Apache 2.0 and brings a number of enhancements to the platform in different areas, as you can see from the commit history in our GitHub source code repository.

4.0, the next major release, is already in the plans and will bring a number of exciting new features, including improvements to our ECL Playground (if you haven’t played with it, I strongly recommend it), significant improvements to ECL watch, support for very fast linear algebra through PB-BLAS (not to be confused with PBLAS) and some other interesting developments, such as more thorough integration with R and reporting tools through external connectors, improved SQL/JDBC connectivity on dynamic Roxie queries for interactive reporting tools, and improvements around documentation and usage examples.

In the next few weeks, also expect to see improvements to our Portal, with the addition of a Wiki for collaborative documentation and a description of our general HPCC Systems roadmap and ongoing projects, to help community members decide if they’d like to join any of these efforts. As part of this move, we are planning to include specific projects that could be good starters for some community members, so please let us know if you would like to tackle any of those.

During 2013, we will be actively working to continue raising awareness for the HPCC Systems platform, and will be specifically focused on community building activities (details coming later). And, from the Exciting Training Department, we are currently working on creating a significant amount of materials for our upcoming MOOC (Massive Open Online Courses), which will help you learn everything that you ever wanted to know about ECL, but were afraid to ask.

And now the shameless plug: Trish and I are tasked with the organization of the first Big Data track as part of the 2013 Symposium on Collaborative Technologies, in cooperation with ACM and IEEE, to happen in May, in San Diego, so please feel free, and a little bit compelled? :) to submit papers, present posters and let us know of any other way that you may want to help.

Now go and enjoy with your families and have a great holiday season! Happy Hacking!

HPCC Systems & Hadoop – a contrast of paradigms

I often get asked about comparing the HPCC Systems platform and Hadoop. As many of you probably know already, there are a number of substantial differences between them, and several of these differences are described here.

In a few words, HPCC and Hadoop are both open source projects released under an Apache 2.0 license, and are free to use, with both leveraging commodity hardware and local storage interconnected through IP networks, allowing for parallel data processing and/or querying across this architecture. But this is where most of the similarities end.

From a timeline perspective, HPCC was originally designed and developed about 12 years ago (1999-2000); our first patent around HPCC technology was even filed back in 2002, and HPCC was in production across our systems back in 2002. To put things in perspective, it wasn’t until December 2004 that the two researchers from Google described the distributed computing model based on Map and Reduce. The Hadoop project didn’t start until 2005, if I remember correctly, and it was around 2006 when it split from Nutch to become its own top level project.

This doesn’t necessarily mean that you couldn’t say that certain HPCC operations don’t use an scatter and gather model (equivalent to Map and Reduce), as applicable, but HPCC was designed under a different paradigm to provide for a comprehensive and consistent high-level and concise declarative dataflow oriented programming model, represented by the ECL language used throughout it. What this really means, is that you can express data workflows and data queries in a very high level manner, avoiding the complexities of the underlying architecture of the system. While Hadoop has two scripting languages which allow for some abstractions (Pig and Hive), they don’t compare with the formal aspects, sophistication and maturity of the ECL language which provides for a number of benefits such as data and code encapsulation, the absence of side effects, the flexibility and extensibility through macros, functional macros and functions, and the libraries of production ready high level algorithms available.

One of the significant limitations of the strict MapReduce model utilized by Hadoop, is the fact that internode communication is left to the Shuffle phase, which makes certain iterative algorithms that require frequent internode data exchange hard to code and slow to execute (as they need to go through multiple phases of Map, Shuffle and Reduce, each one of these representing a barrier operation that forces the serialization of the long tails of execution). In contrast, the HPCC Systems platform provide for direct inter-node communication at all times, which is leveraged by many of the high level ECL primitives. Another disadvantage for Hadoop is the use of Java as the programming language for the entire platform, including the HDFS distributed filesystem, which adds for overhead from the JVM; in contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (we have seen anywhere between 3 and 10 times faster execution on HPCC, compared to Hadoop, on the exact same hardware).

The HPCC Systems platform, as you probably saw, has two components: a back-end batch oriented data workflow processing and analytics system called Thor (equivalent to Hadoop MapReduce), and a front-end real-time data querying and analytics system called Roxie (which has no equivalent in the Hadoop world). Roxie allows for real-time delivery and analytics of data through parameterized ECL queries (think of them as equivalent to store procedures in your traditional RDBMS). The closest to Roxie that you have with Hadoop is Hbase, which is a strict key/value store and, thus, provides only for very rudimentary retrieval of values by exact or partial key matching. Roxie, on the other hand, allows for compound keys, dynamic indices, smart stepping of these indices, aggregation and filtering, and complex calculations and processing.

But above all, the HPCC Systems platform presents the users with a homogeneous platform which is production ready and has been proven for many years in our own data services, from a company which has been in the Big Data Analytics business even before Big Data was called Big Data.

Compiler Tutorial, Step 4: Inlining

See the Introduction for this article here.

Previous chapter: Step 3: The Optimisation, and More Tests.

This step is based on the following commit:

Git commit: Inline dataset_from_transform
https://github.com/hpcc-systems/HPCC-Platform/pull/3323

Now that we have our DATASET running distributed on Thor, and we're quite happy with its behaviour, we can tackle some of the problems we created during the process.

The main issue is that, because some NORMALIZE were transformed into a DATASET(count), the compiler ended up joining them into one sub-expression (via CSE), and it is now a work-unit in itself. Normally, common code is a good thing, but in this case, the cost of spawning a new work-unit might not be worth the benefit of the DISTRIBUTED (when the inline table is too small).

In cases like these, we want to inline code.

The Inliner

The main inlining decision is taken on 'ecl/hqlcpp/hqlinline.cpp', when defining the inline flags. The main function that calculates if an activity has to be inlined or not is 'calcInlineFlags()'. In there, you need to put heuristics to check when can the expression be inlined.

In our case, we can't inline distributed datasets (or we'd have to replicate the engines' logic to know the ranges for every activity our dataset could be inlined, which doesn't make sense). Inlining a transform that has a SKIP could be done, but we have chosen not to, since the benefits would probably not outweigh the costs of supporting it in the long run.

case no_dataset_from_transform:
{
if (expr->hasProperty(distributedAtom))
return 0;
if (transformContainsSkip(expr->queryChild(1)))
return 0;
return RETassign;
}

RETassign is a flag telling we can inline, if we're using the dataset when assigning to a new variable without having to create a new temporary.

The function that will know how to inline, if the flag is non-zero, that is, is the 'buildDatasetAssign()' in 'ecl/hqlcpp/hqlcppds.cpp' (HQL CPP DataSet). As most of the other high-level functions in the compiler, it's just a dispatcher to more specialized functions, in this case, a new one we'll create.

case no_dataset_from_transform:
buildDatasetAssignDatasetFromTransform(subctx, target, expr);
return;

Coding the Inlined expression

On most compilers, inlining happens to already lowered expressions, as an optimisation pass on intermediate code. On our compiler, the inlining is, as you could see above, a step during the lowering of the expression.

It means you have the context in memory (selectors, sources, sinks), instead of having to calculate it afterwards, but it also mean you have to duplicate your generation procedures, producing very different code from each other.

Our new function will have to cope with all that an activity would, so it's not just a matter of overriding the specific members of the base activity class. We have to create the whole thing inside another activity.

We'll need to allocate memory for the rows, search the source dataset for a viable row, iterate through the counter and produce rows in a local sink dataset that will be used by the activity. It also means we have to account for range problems that the engines would.

Preamble

For instance, if the count is known to be zero or negative, nothing needs to be generated:

if (isZero(count) || isNegative(count))
return;

If the count is known to be a positive number, we just iterate. But if we're not sure (say, for a stored - run time - variable), we have to add a run time check.

if (couldBeNegative(count))
{
OwnedHqlExpr zero = createConstant(0, count->getType());
OwnedHqlExpr ifTest = createValue(no_gt,
makeBoolType(),
boundCount.getTranslatedExpr(),
LINK(zero));
buildFilter(subctx, ifTest);
}

Filter is the name of an IF statement, for historical reasons. The function 'buildFilter()' will update the context 'subctx', so that if it ends up building the IF block, it'll move the insertion point inside it. This means every new code that will be added (including the loop) below, will be added inside the loop.

If we wanted to have code outside of it we should have worried about keeping a reference to it, or nesting it with a function call, but we don't, so we just conditionally update the context.

Both functions 'isNegative()' and 'couldBeNegative()' were introduced in this commit, so it's good if you could examine it and familiarise yourself with the type system for the expression graph.

Note that we're creating constants of the same type of the counter, or the comparison could produce unexpected results (regarding sign).

It's not hard to see that this code is creating a boolean value with the result of a 'GreaterThen' comparison between a zero constant and our count expression. However, it's important to notice that we're using the 'boundCount', and not the original 'count' from the expression graph.

If we were to use the counter here and in the iteration just after that, the compiler would generate two function calls to get the counter value, which would be inefficient. So, we 'bind' the counter to a variable:

CHqlBoundExpr boundCount;
buildSimpleExpr(ctx, count, boundCount);

which will be translated to some temporary in the code, like 'v123', and use it for both the if and the for loop later on.

The Main Loop

The loop code is also simple and straightforward. First, we create the induction variable:

OwnedHqlExpr loopVar = subctx.getTempDeclare(counterType, NULL);
OwnedHqlExpr one = getSizetConstant(1);
buildAssignToTemp(subctx, loopVar, one);

Second, we compare it to the bound counter:

OwnedHqlExpr loopTest = createValue(no_le,
makeBoolType(),
LINK(loopVar),
LINK(boundCount.expr));

Finally, we increment the induction variable:

OwnedHqlExpr inc = createValue(no_postinc, loopVar->getType(), LINK(loopVar));

And create the loop:

subctx.addLoop(loopTest, inc, false);

Here, again, 'addLoop()' will change the insertion point, from now on, to inside the loop. Since we know now that we need to iterate anyway, and there's nothing for us to do outside the loop, we, again, will not bother with the lexical block outside out IF/FOR.

Finally, we need to create each row, from the selector of the first row of the source dataset into 'count' rows in the sink dataset.

First, we get the 'value' of the row, that will be generated by the given transform:

OwnedHqlExpr rowValue = createRow(no_createrow, LINK(transform));

Then, we create a physical row, and attach that value to it:

BoundRow * targetRow = target->buildCreateRow(subctx);
Owned targetRef =
buildActiveRow(subctx,
targetRow->querySelector());
buildRowAssign(subctx, targetRef, rowValue);

Finally, we finish the row, making sure memory is handled correctly:

target->finishRow(subctx, targetRow);

Note that the 'buildRowAssign' will take care of the code in the transform, and make sure that any complex transformations that could come from it, including child queries and complex filters, will be coded before actually assigning that value to the row. This is an important feature, if we want to support every possible combination that datasets have available as activities, when inlined.

Side Effects

So far, so good, the code looks to be inlined for the cases we were generating new work-units before. But there is a new issue when we run the compiler regression tests: one activity is inlining too much.

Because we do inlining as we export the code, the compiler failed to notice that some code could be joined up. This was detected and a new function was added to try to reach a balance when it was safe to hoist a dataset and when it was better to inline, called 'isEfficientToHoistDataset()'.

Because the problem was spotted on a resource, so 'ecl/hqlcpp/hqlresource.cpp' was changed on the spot of figuring out whether to hoist or not. The new function added could be in a more generic file, if this problem is about to happen on other places, but for now, it should suffice.

Testing

After applying the patch and running the standard set of regressions, we see that only the changes we were expecting (ie. the ones that improve code) have occurred.

This does not mean others won't happen in the wild, but do minimise the problems we might find in production code.

The added test cases, in both compiler and engines regressions, have accounted for the expected cases of negative numbers (constants or not), zero and variable counts. It also demonstrates, on the compiler regression, that the DISTRIBUTED DATASET does not inline, even having every property to do so otherwise.

The execution tests not only will prevent optimisations to destroy our inliner, but also show to the history what were the intentions when we coded it. It clearly shows that we're expecting no rows for zero and negative (and not an error, for example).

And with that, we finish our compiler tutorial.

Conclusion

Adding features, optimisations and fixing bugs in the ECL compiler is a bit of black magic, as any compiler, and help on the mailing list or the pull request reviews is much appreciated.

Having tutorials like this will help you start changes and ask much better questions, which saves time for you and other developers, but they won't tell you what to do in your case, nor help you with specific questions. Learning that is part of the process and best done if interacting with other developers.

Also, have in mind that the tests require deep knowledge of the ECL syntax and the syntax that is valid (as opposed to what is actually described in the docs), so you will be asking a lot of questions based on the differences you see in the test suites.

For further reference, please check the ECL reference manual and examples.

Compiler Tutorial, Step 3: The Optimisation, and More Tests

See the Introduction for this article here.

Previous chapter: Step 2: The Distributed Flag, and Execution Tests.

This step is based on the following commit:

Git commit: Optimise NORMALIZE to DATASET
https://github.com/hpcc-systems/HPCC-Platform/pull/2314/files

Recapitulating, we have created a new syntax that can run distributed in any number of nodes of the cluster. This is good, but not many people will be using it straight away, and there's a lot of old code that will still be performing sub-optimally on similar tasks that we can help speed up. Such is the case of NORMALIZE.

So, we start by coding in ECL all the things that we'd like to see transformed into the new syntax and all the things that we know we can't (for any reason) and put it into a test case ('ecl/regress/normalize-dataset-opt.ecl').

As you can see, we actually created two files, one for the compiler regression [1] and one for the engine regression suite[2], which should cover the basics on both fronts. The results key file was created after all changes were done and verified to be correct, and executed once to create the output file.

Where to Change?

As mentioned before, there are two main files that handle optimisation in the ECL compiler: 'hqlfold.cpp', which mainly works with simple in-place transformation to fold complex expressions into simpler ones, and 'hqlopt.cpp', which does more complex graph analysis to modify code based on ECL semantics. Our change is simple, and should go into 'hqlfold.cpp'.

The entry point for the folding is 'foldHqlExpression()', which creates an 'CExprFolderTransformer' and call 'transformRoot', which will walk the graph and find all possible ways of folding the child expressions. There are numerous possible paths for the code, depending on which type of expression it is and in which order and format they appear, but ultimately, all of them will call one common method, 'CExprFolderTransformer::doFoldTransformed'.

This method contains a big switch statement with all possible types of nodes with an implementation for each one of them. If we add our node here, wherever they appear in the ECL graph, the code will get here and hopefully folded.

The Change

Since we're looking to fold 'no_normalize' into 'no_dataset_from_transform', we have to introduce a case for no_normalize. In this case, we need to think of all possibilities, which to avoid, and how to fold the rest.

One thing we want to avoid is folding NORMALIZE that uses datasets with more than one row as inputs. That's because our new syntax doesn't account for that, so we'd have to come up with an implementation semantically equivalent, and that would complicate things (and invalidate our previous changes).

Another change we don't want to think of (at least for now), is the kind of NORMALIZE where the counter of the transform refers to the dataset's LEFT. That would make it harder to predict (at compile time) what kind of dependencies are there between the source and the sink.

if (!hasSingleRow(ds) || exprReferencesDataset(count, left))
break;

Now, if the transform in the NORMALIZE has no mention of a dataset (LEFT), our task is ludicrously simple:

From:
NORMALIZE(,10,transform(COUNTER))
To:
DATASET(10,transform(COUNTER))

We achieve that by removing the first child of the NORMALIZE and transforming the node into a DATASET:

First, we put the children into an array, starting from the second one (1):

HqlExprArray args;
unwindChildren(args, expr, 1); // (count, trans)

Then, we return a dataset of type 'no_dataset_from_transform':

return createDataset(no_dataset_from_transform, args);

Complications

If the transform does reference the dataset, we need to be careful. Since we're sure the dataset is not bigger than one line, and we're sure that the dataset is not too complex (the early aborts above), we can replace the LEFT references in the transform by the fields in the dataset. So, 'LEFT.foo' becomes 'ds[1].foo', or the value from the first row.

Remember that LEFT references are like COUNTER references, in so much as they point to a source. COUNTERs point to number sources while LEFT (and RIGHT) point to a data source in a remote dataset or transform.

So all we need to do is to update the reference to point directly to the source instead of its pointer. It's much like collapsing a pointer-to-pointer to a direct pointer to the object.

This will tell you if the transform references the dataset's selector:

IHqlExpression * transform = expr->queryChild(2);
OwnedHqlExpr left = createSelector(no_left, ds, querySelSeq(expr));
if (exprReferencesDataset(transform, left))

Now you need some ECL knowledge to know what to expect. (Remember, you can always ask on the list or rely on the code review process). Some examples of datasets that will trigger this transformation are:

DATASET(ROW(transform))
DATASET([transform()])
DATASET([value],{ myfield })

The first one is a 'no_datasetfromrow' and the two last are 'no_inlinetable'.

A dataset from row, as expected, has the row directly linked to the dataset as the first child. So, simply adding that row as the source for the new transform selector is enough:

case no_datasetfromrow:
IHqlExpression * row = ds->queryChild(0);
if (row->getOperator() == no_createrow)
newRow.set(row);
break;
}

The second syntax accepts a list of transforms, so we need to be careful. And once we have the list, we need to create a row from which to pick the selector:

case no_inlinetable:
{
IHqlExpression * transformList = ds->queryChild(0);
newRow.setown(createRow(no_createrow, LINK(transformList->queryChild(0))));
break;
}

The LINK here is a Link-Counted smart pointer. You need to be careful with those but we won't cover its usage in this tutorial.

Finally, we may have (or not) a new row in hand. If we do, we change the selector from the original transform for this new row and discard the original dataset and transform:

if (!newRow)
break;
OwnedHqlExpr replacementRow = createRow(no_newrow, LINK(newRow));
newTransform.setown(replaceSelector(transform, left, replacementRow));

In essence, we have created a new transform that does not reference the original dataset, but its own selector is the selector of the dataset itself. Now, we just need to replace the old transform by this one when creating the new dataset syntax.

if (newTransform)
args.replace(*newTransform.getClear(), 1);

Note that there are lots of 'break's in the code and we check if we actually created new constructs all the time. These checks will abort on the slight chance, and make sure we don't optimise something that we weren't expecting.

In the end, the new expression will only be returned if we're sure it's correct. If not, it'll just return the same expression, so the caller can compare if it's the same (ie. not optimised) and continue.

Side Effects

When changing the compiler, and trying to run the regression suite, you'll see that sometimes the compiler fails on some files, which is easy to fix, but sometimes it'll produce slightly different results, which can be very hard to fix. It's often the case that one small change triggers another bigger, that itself triggers another and so on.

Luckily, in our case, there wasn't a big change, but we found that some PROJECTs weren't cleaning unused fields in a DATASET when that node type was 'no_dataset_from_transform'.

A quick look on the compiler files, and some questions on the list, gets us to 'hqliproj.cpp', which does the project transformations. The core function is 'ImplicitProjectTransformer::createTransformed'.

If you remember that our dataset is a source (not a transformation), we need to look for nodes similar to ours. Most of the 'activityKind's in the main switch are related to transformations, or sinks or records. There is one, however that is called 'CreateRecordSourceActivity', deals with 'no_inlinetable' and has the following comment:

//Always reduce things that create a new record so they only project the fields they need to

Bingo! But looking at the code, you see that the inline table has a list of transforms, while our dataset has only a transform. So we need to do the same thing with no_dataset_from_transform, but getting a transform directly instead.

Also, we need to make sure nodes of type 'no_dataset_from_transform' get marked correctly as 'CreateRecordSourceActivity', and for that, we find where it is set for 'no_inlinetable', which is 'ImplicitProjectTransformer::getProjectExprKind'.

Final testing

With these changes, the un-optimised PROJECTs end up again optimised, and all looks good. Run the compiler test in Thor or Roxie and see the output. If it makes sense, and you got all the corner cases you could think of, record the output as a key file and add it to the regression test (both ECL and XML files), and it's time to submit a pull request.

References

See reference [1] and [2] here.

Follow-up

The next step on the tutorial is Step 4: Inlining.

Compiler Tutorial, Step 2: The Distributed Flag, and Execution Tests

See the Introduction for this article here.

Previous chapter: Step 1: The Parser, The Expression Tree and the Activity.

The Compiler Part

Git commit: Add DISTRIBUTED flag to DATASET(N, trans(COUNTER))
https://github.com/hpcc-systems/HPCC-Platform/pull/1338/files

Now that we have a working version of the new syntax, we need to make it worth the effort. Our original intent was to make it distributed (instead of running only on the master node), so let's do this.

Remember, that the DISTRIBUTED flag will only work in Thor, since Roxie will only execute the 'inlinetable' on a single node by design.

The Parser

First, we need to add the flag to the Yacc file. We can just add it directly, but we can also prepare for the future, when you could add more flags to this syntax that are also supported by other similar constructs.

We add three new items: 'optDatasetFlags', which can contain a number of 'datasetFlags' which builds a "comma" list of all 'datasetFlag' items. The final 'datasetFlag' can be one of many options, but for now, we just have the one "DISTRIBUTED", which will just create a new attributed bound to the atom 'distributedAtom'. See the commit above on 'hqlgram.y' for more information.

Again, if you remember from previous steps, this is just a way to name a token, so later we can retrieve this information and change the back-end accordingly. On your original line, you just need to add 'optDatasetFlags' at the end of the list (before the final parenthesis).

| DATASET '(' thorFilenameOrList ','
beginCounterScope transform endCounterScope
optDatasetFlags ')'

and now you have an attribute that might (or might not) be on the DATASET() declaration.

The Helper

An attribute is not always treated on the same place. Some attributes are used for compilation purposes, others for execution purposes. Since we don't want to create two activities, one distributed and one not (exponential growth when you add more attributes), we need the activity to know what this means.

One way to do that would be to change the 'CThorTempTableArg' interface to pass this parameter to the activity, but here we reach a conundrum: The old temporary table helper didn't have any flag other than 'isConstant'. We could add another method 'isDistributed', but since we prepared ourselves for the future above (by allowing a list of flags to be passed), this would be at least short-sighted.

Other helpers use a much more flexible 'getFlags' functionality, where you only need to change the enum that it returns (by adding new nodes, deprecating old ones, like we did before in the AST nodes) and the underlying activity doesn't have to change.

This is important for old compiled code that would take ages to compile again on a new platform, since it'd pull out a long list of other dependencies and possibly disrupt production, etc. We tend to require a new compilation only when it's extremely necessary, or when there is a new major version[3].

Git commit: Replacing TAKtemptable for TAKinlinetable
https://github.com/hpcc-systems/HPCC-Platform/pull/1987/files

For this reason, and to make it future proof, we decided to create a new helper interface called 'IHThorTempTableExtraArg' (which was later changed to 'CThorInlineTableArg', see the commit above), that will account for 'getFlags' and allow further expansion without new changes in the interface.

So, we merge the current flag (constant) with the new one (distributed):

enum
{
TTFnoconstant = 0x0001, // default flags is zero
TTFdistributed = 0x0002,
};

This deserves some explanation. The default flag result is zero, as you can see on every other helper and the default behaviour of that specific helper was that the dataset was constant. So, to match the behaviours, we introduced a non-constant flag to tell otherwise. 'dynamic' is not the correct description.

Lots of comments were added to indicate that the old behaviour is deprecated and that on future versions of the platform (4.0 in this case), those helpers should be removed.

The Back-End

Finally, we need to generate the function in our 'BuildActivity' function. This is not too complex, though, since we just need to 'OR' all flags into an unsigned int and build an unsigned function using a helper. However, since other helpers change (with the new interface), we need to change them as well.

Old versions of those helpers (compiled before this version goes out), will contain only the 'isConstant' function, and all new ones will contain both, with the 'isConstant' calling 'getFlags' for portability.

So, for portability, we refactor the 'getFlags' into a new function, called 'doBuildTempTableFlags', which simply join the flags on 'doBuildTempTableFlags':

StringBuffer flags;
if (expr->hasProperty(distributedAtom))
flags.append("|TTFdistributed");
if (!isConstant)
flags.append("|TTFnoconstant");
if (flags.length())
doBuildUnsignedFunction(ctx, "getFlags", flags.str()+1);

Which will only override 'getFlags' if there is any flag that need setting other than the default behaviour. Again, this setting of flags is done elsewhere in the code and could be more automated, but that's for a different tutorial.

Now, we need to make 'isConstant' use 'getFlags', in 'CThorTempTableArg':

virtual bool isConstant() { return (getFlags() & TTFnoconstant) == 0; }

So, old code that relies on this information can still work.

Finally, we need to change all other helpers to export the 'getFlags' instead of 'isConstant'. In 'hqlhtcpp.cpp' you need to search for 'isConstant' and make sure they relate to TempTables. One example:

From:
if (!isConstantTransform(transform))
doBuildBoolFunction(instance->startctx, "isConstant", false);
To:
doBuildTempTableFlags(instance->startctx,
expr,
isConstantTransform(transform));

Tests, Tests, Tests

The previous test case in the compiler had a DISTRIBUTED case added, and its generated code examined and thought to be correct. Note that in this commit we did add an execution test, though it's still not executing distributed, because the engines have no knowledge of out new flag.

The Engines Part

Git commit: Implement DISTRIBUTED DATASET(COUNT) in Thor
https://github.com/hpcc-systems/HPCC-Platform/pull/1965/files

To add the new functionality on the engines, it's best if you add them all at once, which makes for a big change. If you skim through the pull request, you'll see that our three engines (Roxie, Thor and HThor) have very similar inner-workings. So, although big, this change is more mechanical than the others.

If you're still paying attention, you'll notice that we started by saying we could use an existing activity (TempTable), but ended up creating a new one (InlineTable), so the engines haven't got a clue what to do with that code. And since we want independent code form the previous form (so we can deprecate it more easily later), we need to create three different activities, one for each cluster. This shouldn't be strictly necessary, but the engines are designed radically different, which might change in the future.

And, for the engines to acknowledge the existence of such activities, like in the compiler, we need to create the nodes on the global switches:

In HThor ('ecl/eclagent/eclgraph.cpp'):

case TAKinlinetable:
return createInlineTableActivity(agent,
activityId,
subgraphId,
(IHThorInlineTableArg &)arg,
kind);

In Roxie ('roxie/ccd/ccdquery.cpp'):

case TAKinlinetable:
return createRoxieServerInlineTableActivityFactory(id,
subgraphId,
*this,
helperFactory,
kind);

In Thor ('thorlcr/slave/slave.cpp'):

case TAKinlinetable:
ret = createInlineTableSlave(this);
break;

Again, the same issue of finding the other places to put a case in the compiler apply here. See discussion in step 1 and the pull request above for some hints.

Each of this functions will simply create a new Roxie/Thor/HThor activity instance and return it. They're used to create the class structure that implement the activities in the engines.

Non-Distributed Activity

Getting the HThor's example ('ecl/hthor/hthor.cpp'), since it's the simplest of all, the two main functions you need to implement are 'ready' and 'nextInGroup'. The former will prepare all data prior to execution and the latter will give you the next item in the list, like an iterator.

As you can see in the commit, the 'start' prepares the 'numRows'. That happens because the construction of the activity can be at a time when we still don't have full information on the activity yet, so the real initialisation is done when we're actually ready to start.

Also, if the DATASET is in a child query, there can be multiple calls to 'onStart()', but in any case, there will be only one call to 'onCreate()'.

The 'nextInGroup' is an iterator that will provide the stream of input through the activities' code. If you follow the same style as other table activities, you'll see that a simple iteration will solve this problem:

while (curRow < numRows)
{
RtlDynamicRowBuilder rowBuilder(rowAllocator);
size32_t size = helper.getRow(rowBuilder, curRow++);
if (size)
{
processed++;
return rowBuilder.finalizeRowClear(size);
}
}

A 'RowBuilder' will allocate enough space for the row and if the the row has any content, it'll increment the counter and proceed.

The size had a specific purpose before this change, which was to let the helper control the counter by returning zero when there was no more rows to process. However, that would call the helper's 'getRow' more times that it needed, since the activity had no way to know it had stopped, or was just a temporary fluke.

Distributed Activity

Unlike HThor (that is just a test engine), Thor and Roxie actually execute jobs in parallel. In this particular commit, the Roxie activity is still executing in serial, but the Thor activity is parallel, so let's have a closer look at it.

In the serial algorithm, the start/ready phase sets the number of rows, but in a parallel version, we need to make sure we set these boundaries to our block of the problem. In our case, it's not too hard to do, since we always have a range of 1 to N elements in our DATASET.

So, our parallel 'start' has to account for the different blocks to process. We do that by simply taking the node ID (which logical node it's running on) and calculate the range:

startRow = (nodeid * numRows) / nodes;
maxRow = ((nodeid + 1) * numRows) / nodes;

If the DATASET is not distributed, than only the 'firstNode()' compute (maxRows set to max), while others wait (maxRows set to 0). The 'nextInGroup' implementation is very similar to the parallel version (another benefit of this approach).

References

[3] See VERSION at the root of the git tree

Follow-up

The next step on the tutorial is Step 3: The Optimisation, and More Tests.

Compiler Tutorial, Step 1: The Parser, The Expression Tree and the Activity

See the Introduction for this article here.

This part of the tutorial refers to the commit bellow:

Git commit: DATASET (N, transform(COUNTER))
https://github.com/hpcc-systems/HPCC-Platform/pull/1285/files

This step has many sub-steps, but since we must keep consistency, we can't just add features in one place (say, the parser) without adding it to the rest. So, this pull request changes all places needed to add a new syntax, although a syntax that doesn't add new activities (so, no changes in the engines).

Feel free to have a peak at the code now and realise that there are not many changes to do. The problem, however, is to know where and how to change. At hindsight, it always looks simple.

The Parser

Finding the Place

The parser file, as any Yacc file, is very hard to read, understand the connections between tokens and even harder to make changes without breaking it. Anyone that has even tried to change a Yacc file will tell you the same, so don't worry if the first 10 attempts to change it fails miserably.

The first step is to add it to the Parser, so new ECL can be transformed into an expression tree inside the compiler. To do so, we must open the 'hqlgram.y' file and look for similar syntax.

Since ECL syntax, like in most languages, uses parenthesis to aggregate arguments, you have to look for "DATASET '('" pattern to see where the other DATASET constructs are declared. Remember, we're trying to find similar syntax to be able to reuse or copy code from other constructs to our own, so the closer we get to what we want, the better.

The first matches (from the top of the file) won't give you the right place. Since DATASET() has many uses, we need to skim over the false-positives. Ex:

DATASET '(' recordDef ')' only declares a potential dataset from a record
DATASET '(' dataSet ')' same, but using another dataset's record
VIRTUAL DATASET '(' recordDef ')' nothing in there...
DATASET '(' recordDef childDatasetOptions ')' only adding a field...

Then, around line (8170), you get to:

DATASET '(' thorFilenameOrList ',' dsRecordDef ...

That's big, ugly and doing a lot of things, lets see what it is.

thorFilenameOrList is an expression, that's promising...

Initially, an 'expression' was used, but 'thorFilenameOrList' (maybe not the best name) can cope other features (for example, initialization with lists of rows, file names and expressions), that removes the ambiguity of the parser.

So that's a good place to start looking at the code inside and see what it does. If you look at each implementation, you'll see how they manipulate data and how they convert the ECL code into the abstract syntax tree (AST). Example:

origin = $3.getExpr();
attrs = createComma(createAttribute(_origin_Atom, origin), $8.getExpr());

The third token gets into the 'origin' expression, 'attrs' is a "comma" (ie. a list of expressions) containing an "origin" Atom referencing 'origin' and the 8th token in the list.

Atoms are identifiers, or known properties. That code is saying that it knows a thing called "origin" and the associated expression is 'origin'. This is used by the tree builder and optimiser later on to identify features, filter and common up expressions based on those attributes.

Adding a new DATASET

Now it's time to think what we need to do. We need to create a dataset, from a transform, using a counter. Since you've got a counter to handle, we need to think that are the consequences of having one. Go to the language reference doc and look for other constructs that have counters, and you'll find many, for example, NORMALIZE. Searching for "NORMALIZE '('" in the parser file, you'll find a curious token:

beginCounterScope / endCounterScope

Scopes of counters are necessary when you have nested transforms and each one has its own counter. You can't mix them, or you'll get wrong results (ie. an internal counter being updated by an external increment). You'll also see that the "counter" object is bound to the end of scope token.

A counter is not a number, but a reference to a number source. In our case, the "count" is the number source, which can be anything from a constant to an expression to a reference to an external object. Following the NORMALIZE case, you get to something like this:

DATASET '(' thorFilenameOrList ',' beginCounterScope transform endCounterScope

That represents a DATASET(expr, TRANSFORM(COUNTER)). Which can be implemented as:

IHqlExpression * counter = $7.getExpr(); // endCounterScope
if (counter)
counter = createAttribute(_countProject_Atom, counter);

As you have seen in the NORMALIZE case, you must name it a "counter", by creating an attribute with a count Atom to it. The name "countProject" is maybe not the best, though.

$$.setExpr(
createDataset(
no_dataset_from_transform,
$3.getExpr(),
createComma(
$6.getExpr(),
counter)));

With this you create a dataset, naming the operation "no_dataset_from_transform" (that needs to be added to the list of operations, you'll see later), with the counter source expression ($3) and a "comma" with the transform ($6) using a counter. The comma is necessary because the createDataset() accepts a list of operands.

If you follow the NORMALIZE and other DATASET examples, you'll see that you also need to normalise the numeric input, to make sure it's a valid integer:

parser->normalizeExpression($3, type_int, false);

and update the parser's context position to the current expression:

$$.setPosition($1);

However, just adding that code created ambiguities on the other DATASET constructs. This is because of the counter object conflicting with other declarations. One way to fix it (yacc style) was to add the counter object to all conflicting constructs and introduce an error message. So, we're moving the fact that counters are not meaningful on those other constructs, from a syntax error to a semantic error. Check the pull request on 'hqlgram.y' for more information on that.

The AST

Now that we have a node in our AST being built by the parser, we need to support it. The first thing you do is to the node operator's enum, in 'hqlexpr.hpp'. Find the first 'unusedN' item and replace it with your own:

no_some_other,
no_dataset_from_transform,
unused10,

Unused items were once real nodes that got removed from the language. To keep backward compatibility with old compiled code, we keep the same order and id of most of our enums. Adding a new node at the end would also work, but would leave a huge trail of unused ones in the middle.

You can now try to compile your code and execute the compiler regression[1] to make sure you haven't introduced any new bug. If all is green, create an example of what you expect to see in ECL:

r t(unsigned value) := transform
SELF.i := value;
end;
ds := DATASET(10, t(COUNTER));
OUTPUT(ds);

This should create a dataset with items from 1 to 10. If you run this example through your code, you'll see that it'll fail in multiple places. That's because the compiler has no knowledge of what to do with your node. One way to do that is to keep running that code through your compiler and fixing the failures (ie. adding no_dataset_from_transform cases to switches with appropriate code).

What is appropriate code? Well, that varies. If you see in the pull request, there are many places where the new node was added, and each had a different case. Examples:

@ bool definesColumnList(IHqlExpression * dataset)
case no_dataset_from_transform: return true;
@ const char *getOpString(node_operator op)
case no_dataset_from_transform: return "DATASET";
@ childDatasetType getChildDatasetType(IHqlExpression * expr)
case no_dataset_from_transform: return childdataset_none;

How do you know these things? Normally, it's either obvious or very hard to answer. Putting the wrong value, in this case, won't do much for the other constructs (since we're restricting to no_dataset_from_transform), but it may give you the wrong sensation of success. It might work for the cases you have predicted, but it might fail for others, or simply be wrong.

Our suggestion is to use a mix of trying for yourself and asking on the list, but rest assured that, if you ever get right results with bad code, we should be able to spot it on pull request reviews. Adding comments to the code, to make that task easier is a work in progress.

One quick example of ECL's AST folding (more to come on next step), is to fold null datasets. In the pull request above, check the file 'hqlfold.cpp'. It's checking for a zero sized count (via 'isZero', which checks if the expression evaluates to zero, not if the value is a constant zero), and replaces it with a null expression (via 'replaceWithNull').

The Activity

The AST will be translated into activities. These activities are coded in C++ by the compiler's back-end, collected into a driver that will execute the graphs in order and compiled again to a shared object. Each shared object is a workunit, that are executed in the HPCC engines (Roxie, Thor).

Each activity has a helper, and that's the class you'll have to implement from the AST node, so the activities in the engines can use it to execute your code. You need to find out what existing activity (if there is one) maps to the same functionality as your node. In our case, we're very lucky that there is one activity perfect for this job, but the question is, how to find it?

All activities implement the IHThorActivity interface, so you can consult the 'ecl/hthor/hthor.hpp' file and list all base classes of that interface (aka. pure virtual struct). I assume your IDE does that for you, if not, 'hthor.cpp' and 'hthor.ipp' will give you plenty of material to read.

If you look thoroughly enough, or ask an experienced developer, you'll find out that 'CHThorTempTableActivity' is the activity you're looking for. The reason being that this activity builds a new dataset from scratch (ie. it's a source) and it does so in a simple way. This activity is generated by the compiler and also the syntax:

DATASET(my-set, { single-fielded-record }).

If there is no activity that could possibly be mapped to this new AST node, you will have to create one and make all engines implement it. Choosing which one to derive from follows the same logic described above, though.

Now that we know what activity we're aiming towards, we'll try to export our AST node to it, so it'll automatically be executed by all engines. To do so, you must add a new 'BuildActivity' in the ECL-to-CPP translator. The translator is implemented by the class 'HqlCppTranslator' in 'ecl/hqlcpp/hqlhtcpp.cpp'.

You need to add a hook in 'buildActivity' to call your function when the case is a no_dataset_from_transform. Following the convention in the rest of the class, we called it 'doBuildActivityCountTransform'. Your new function will probably accept the context and the expression node.

Building an Exporter

An exporter function will generate a helper class, that derives from your base object implemented from the interface named above. Our activity's helper is an 'IHThorTempTableArg' which is partially implemented by 'CThorTempTableArg'.

Your class will have to base 'CThorTempTableArg' to re-use its generic methods and to have the link-counted logic that all classes do. The main methods that 'CThorTempTableArg' haven't implemented from 'IThorTempTableArg' are:

size32_t getRow(ARowBuilder & rowBuilder, unsigned row);
unsigned numRows();

'numRows' will return the number of rows, so the activity can stop at the right time or allocate the right amount of memory beforehand. 'getRow' will return the next row in line, and in our case, update the internal counter.

So, first we need to create a "TempTable" activity instance:

Owned instance =
new ActivityInstance(*this, ctx, TAKtemptable, expr,"TempTable");
buildActivityFramework(instance);
buildInstancePrefix(instance);

and override the functions we need. See that we're passing the context (where in the resulting C++ file we're writing to) to the activity, so our instance has its own context. From now on, we have to re-use the context of the activity to write any code in, otherwise it'll be written on the outside of the class and we would get a compilation error.

A simple function is to check if the activity is constant. Normally it is, for temp tables, so the default behaviour is true. But in our case, the transform might not be constant (depend on external factors which we can't predict), so we need to change it accordingly:

// bool isConstant() - default is true
if (!isConstantTransform(transform))
doBuildBoolFunction(instance->startctx, "isConstant", false);

See that we're using a 'doBuildBoolFunction' function, which is a wrapper to create a new function that returns a boolean from the result of the expression, in our case, "false".

This means, we only override the function on those activities that really need and that we know at compile time. If we didn't know it, we would have to pass an expression to be evaluated, and override that function every time. Of course, compile-time evaluation is always better, so in this case, we use it.

Overriding 'numRows' is also simple, just by passing the count (whether an expression or a constant), and the wrapper will take care of the rest:

// unsigned numRows() - count is guaranteed by lexer
doBuildUnsignedFunction(instance->startctx, "numRows", count);

However, 'getRow' is somewhat more complicated. We have to build an expression to account for the counter's increment, but also make sure we keep it attached to the counter object (that, if you remember, is just a reference to a number), so the transform can use it. We have to do the function building ourselves.

Start a new context (inside the activity's context):

BuildCtx funcctx(instance->startctx);

Add a function declaration (this could be more automated):

funcctx.addQuotedCompound("virtual size32_t getRow(ARowBuilder & crSelf, unsigned row)");
ensureRowAllocated(funcctx, "crSelf");

And bind the cursor's selector to the counter object ('row' is the current id):

BoundRow * selfCursor = bindSelf(funcctx, instance->dataset, "crSelf");
IHqlExpression * self = selfCursor->querySelector();
associateCounter(funcctx, counter, "row");

And finally, build a transform's body using the bound counter (self):

buildTransformBody(funcctx, transform, NULL, NULL, instance->dataset, self);

With this, your class will be exported as overriding a CHThorTempTableActivity's helper (CThorTempTableArg), so whenever that node of the graph is executed in any engine, the workunit (shared object calling graphs to execute) will pass it to the engine, which will use your methods to build a dataset.

The Test

Now that we have the basic functionality working, we need to make sure we will be able to handle all new cases we're considering (and hopefully make sure we fail where we should fail). To do that, add a test case with the possible syntax you'll expect to work, and one with the things you expect will fail (note that the pull request mentioned doesn't have this!).

If all of them compile, you're on the right track. But you have to investigate the generated C++ code to make sure it's doing what you expect it will. Since we are only re-using an activity, the activity you generate has to be similar to other implementations of the same activity. To check that, check the other files that implement the 'CHThorInlineTableActivity' and compare.

You need to be careful with the exact code. We want 'numRows' to reflect the uncertainty passed via ECL, on stored variables, for example. We want the 'getRow' to make sure it's updating the counter every time it returns a row and that it does so at the right time. See:

Git commit: Dataset count range starts at 1
https://github.com/hpcc-systems/HPCC-Platform/pull/2165/files

The pull request above fixes a bug where the original code assumed the rows started at zero, when in ECL they actually start at one.

If you are lucky enough to not have to add activities with your changes, you can also add tests to the regression suite[2] to make sure the output of your new class is in sync to what you expected in the first place, possibly using the same (or a similar) test as you used in the compiler regression.

Once you're happy with the output, no other test fails and the failures on your new construct are being caught on your negative test, you're ready to submit this pull request.

References

[1] The compiler regression suite is a diff-based comparison, where you run it with a clean top-of-tree version (of the branch you're targeting to) and run again with your changes and diff the results (logs, XML of intermediate code and resulting C++ files).

A tutorial on how to run the regressions and how to interpret the differences (which can be daunting, sometimes) is on the way of being created. Please, refer to 'ecl/regress' directory and the regression scripts 'regress.sh' or 'regress.bat' in it for more information in the meantime.

[2] The regression suite is a set of tests that compiler and execute code on all three Engines (Thor, Roxie and HThor). Please refer to 'testing/ecl' directory for more information.

A tutorial on how to run the regression suite (not the same as the compiler regressions above) should take a while, since the underlying technology is changing. Ask on the mailing list for more info.

Follow-up

The next step on the tutorial is Step 2: The Distributed Flag, and Execution Tests.

Compiler Tutorial, Introduction

This tutorial will walk you though adding a new feature in the compiler, making sure it executes correctly in the engines, and performing some basic optimisations such as replacing and inlining expressions.

When adding features to the compiler, there are two main places where you have to add code: the compiler itself, including the parser, the expression builder and exporter, and the engines (Roxie, Thor and HThor), including the common graph node representation.

You need to make sure all possible variations of your new construct will work, not only by itself, but in conjunction with other features of ECL, by creating exhaustive tests on both compiler and regression suites.

Finally, we'll see how to add flags, optimise another query into your optimal new construct and allow them to be exported inlined.

The aim to this text is to appear as a PDF document to guide people changing the ECL compiler, but I have decided to post it in full on the blog, as a request for comments as well as providing early access to it.

The Feature

This walk-through is based on the implementation of:

DATASET(count, TRANSFORM(..., COUNTER, ...))

This DATASET syntax will execute the TRANSFORM 'count' times, passing it as a parameter to it, where numerical fields are expected, to build incremental datasets. This feature is useful for creating test tables, where the data is used to test other features, or accessory tables, when joined with other tables could help you organise them.

There was another syntax that is used to achieve the same functionality, when the dataset had only one ROW:

NORMALIZE(dataset, count, TRANSFORM(..., COUNTER, ...))

This syntax is not clear to what is its intentions and sometimes required the creation of a dummy dataset, which made code less readable. We also wanted to make that operation distributed across the nodes, and to do so on a syntax that is already known and complex (like NORMALIZE, with so many other uses) was more complex than on a new syntax.

So, it was clearer (and easier) to add a new simple (and meaningful) syntax, get NORMALIZE to optimise to it on certain conditions, and distribute the DATASET.

We'll follow the commits in Github as a real-world annotated walk-through on how to implement new features in the compiler, new activities in the engine and provide a way to test it. It might not be the optimum path, but it is a real one and will help you understand the kind of problems we try to solve and how we do it in the wild.

Each step will be referenced by its pull request in GitHub, so you can refer to them as a complement of this tutorial.

The Files

All compiler files are within the directory 'ecl/hql' in the source tree, including the parser, tree builder, optimisers and exporters. You'll add your new feature on those files, and you'll need some tests under 'ecl/regress' to make sure the compilation part of the process is sane.

We use bison to generate our parser from Yacc files. The main file holding the whole grammar is 'hqlgram.y'. This file contains all definitions, reserved keywords and general structure of the language. 'hqlexpr.cpp' is the core of the tree builder, while 'hqlopt.cpp' and 'hqlfold.cpp' are the main optimisers, the former for general optimisations and the latter mostly for folding expressions.

Roxie activity files under `roxie/ccd`, Thor's under `thorlcr/activities' and HThor's under 'ecl/hthor'. Those files need to be changed if you're adding not only a new syntax (ie. a different way of performing the same activity), but also a new activity, or at least, changing the way the activity is executed.

The contents of this tutorial are expanded into the next four posts:

Step 1: The Parser, The Expression Tree and the Activity.
Step 2: The Distributed Flag, and Execution Tests.
Step 3: The Optimisation, and More Tests.
Step 4: Inlining and Conclusion.

The importance of ETL for Machine Learning on Big Data

As I was preparing the Keynote that I delivered at World-Comp'12, about Machine Learning on the HPCC Systems platform, it occurred to me that it was important to remark that when dealing with big data and machine learning, most of the time and effort is usually spent on the data ETL (Extraction, Transformation and Loading) and feature extraction process, and not on the specific learning algorithm applied. The main reason is that while, for example, selecting a particular classifier over another could raise your F score by a few percentage points, not selecting the correct features, or failing to cleanse and normalize the data properly can decrease the overall effectiveness and increase the learning error dramatically.

This process can be especially challenging when the data used to train the model, in the case of supervised learning, or that needs to be subject to the clustering algorithm, in the case of, for example, a segmentation problem, is large. Profiling, parsing, cleansing, normalizing, standardizing and extracting features from large datasets can be extremely time consuming without the right tools. To make things worse, it can be very inefficient to move data during the process, just because the ETL portion is performed on a system different to the one executing the machine learning algorithms.

While all these operations can be parallelized across entire datasets to reduce the execution time, there don't seem to be many cohesive options available to the open source community. Most (or all) open source solutions tend to focus on one aspect of the process, and there are entire segments of it, such as data profiling, where there seem to be no options at all.

Fortunately, the HPCC Systems platform includes all these capabilities, together with a comprehensive data workflow management system. Dirty data ingested on Thor can be profiled, parsed, cleansed, normalized and standardized in place, using either ECL, or some of the higher level tools available, such as SALT (see this earlier post) and Pentaho Kettle (see this page). And the same tools provide for distributed feature extraction and several distributed machine learning algorithms, making the HPCC Systems platform the open source one stop shop for all your big data analytics needs.

If you want to know more, head over to our HPCC Systems Machine Learning page and take a look for yourself.

Flavio Villanustre

Contact Us

email us   Email us
Toll-free   US: 1.877.316.9669
International   Intl: 1.678.694.2200

Sign up to get updates through
our social media channels:

facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

Get Started