Blogs

GSoC final pencils down date is finally here!

Today at 19:00 UTC is the final pencils down date and also marks the opening of the GSoC final evaluation period.

All coding, unit tests and supporting documentation must now be complete. Both our GSoC students are on target to make this deadline. So how have the projects gone?

Expand the HPCC Systems® Visualizations Framework

Anmol Jagetia’s work involved adding unit tests and linting as well as adding new visualisation widgets and enhancing existing ones. He used his existing experience to enhance our build quality infrastructure and has also added a range of new features to the existing framework including the addition of a time lapse capability and a number of features which enable bar charts to be used as Gant charts. The work he has done will significantly improve the user experience.

Add Statistics to the Linear and Logistic Regression Modules

Sarthak Jain has completed the work required for this project and the new statistics will be added to the HPCC Systems® Machine Learning Library.

The statistics he added provide metrics that indicate the ‘goodness’ of the model created. He completed the tasks associated with these statistics in very good time. So when one of our modelling groups asked for some additional statistics to be added, Sarthak agreed to add those too.

He added 3 stepwise functions to the same modules which find the best model by adding or taking away independent variables. A ‘goodness’ metric was also added to select which independent variables are added to or taken away from the model. The 3 functions he added were Forward, Backward and Bidirectional.

Sarthak has certainly made a valuable contribution to our Machine Learning Library that is of direct benefit not only to one of our own teams but, also provides everyone who uses the Linear and Logistic Regression Modules with a solid set of statistics that give vastly improved results about the models created.

Integrating these features into HPCC Systems®
So when can you get your hands on the new features and enhancements added by our GSoC students and interns who have been working with us this summer?

Most will be made available as part of the HPCC Systems® 6.0.0 release targeted for the beginning of 2016. Some may be available sooner depending on how tied they are to new platform code added as part of that release.

We’ve asked all students who have worked with us this summer to create a short YouTube video or presentation about their project and experience working with us. I’ll repost with links to these and confirm details about the availability of the new code later.

We'll also be showcasing the work all our students have completed at the HPCC Systems® Engineering Summit at the end of September.

What about next year?
Finally, well done and thanks to all our students and mentors who have worked so hard this summer. It's been great experience to be part of GSoC. We'll definitely apply again next year, so make sure you keep an eye on the HPCC Systems® GSoC Wiki, Forum and keep checking here for more posts about the student programs and projects available.

We also plan to run the HPCC Systems® Summer Intern Program again next year too. If you are interested in this program, email Lorraine Chapman or Molly O'Neal for more details.

Notes:

1. Read Anmol Jagetia's blog to find out more about his visualizations project.

2. We will be updating our ideas list and making other changes to the HPCC Systems® GSoC Wiki later in the year.

3. To find out more about the HPCC Systems® Machine Learning Library see the Machine Learning Library Reference.

Results of the HPCC Systems® Summer Intern Projects 2015

Here we are well in to August already. Our two HPCC Systems® interns are putting the finishing touches to their projects, completing documentation and submitting evaluations about their time working with us.

Implement the CONCORD algorithm into the HPCC Systems® Machine Learning Library

Syed Rahman’s project is now complete. The CONCORD algorithm is a method to estimate the true population of a co-variance matrix. The co-variance matrix is a summary of the relationship between every pair of fields in the data. Co-variance values close to zero indicate that the fields don’t have a relationship. Values close to 1 indicate a positive relationship and values close to –1 indicate an inverse relationship.

In classic statistics there are many more observations than fields. In this case, the co-variance matrix of the sample is a good estimate for the true co-variance matrix.

Unfortunately, in big data, there any many cases where the number of fields exceeds the number of observations or may be close to the number of observations. It is the case that the sample co-variance matrix is a very poor estimate for the true co-variance matrix.

It’s clear that Syed’s addition to our Machine Learning Library is an important improvement, providing a way to getting more reliable results in this area.

Syed is speaking at the HPCC Systems® Engineering Summit at the end of September this year. His presentation demonstrates how this algorithm works and why it is a better method of getting the true population of a co-variance matrix. I’ll post a link to the recording of his presentation as soon as it is available.

Improving Child Query Processing Project
Anshu Ranjan has also just completed his time as an intern with us. He’s been working on this project with his mentors Gavin Halliday and Jamie Noss.

While this project has had limited success there are many reasons why this should be the case. The code generator is pretty much the engine room of the system and as a result, not only do you need an eye for detail but, you also need a good overview of the entire system and how each component interacts with others. Gavin Halliday is our resident expert in this area and through years of experience knows it inside out. So we know that it is a complex and challenging area.

One of the things we hoped to get out of this project was some feedback on how to improve the internal documentation for developers so that others in the future can contribute to the codebase. Having a student working in this area has certainly helped us to highlight some specific improvements we can make to our internal documentation and we have already made some as a result of Anshu joining the team.

And finally...

Our thanks go to Syed and Anshu for contributing to HPCC Systems®. It's been great having them work with us and we appreciate the work they have done.

We also wish them well while they complete their studies and decide on a future career. Perhaps our paths will cross again sometime in the future!

Notes:

1. Read Syed Rahman's blog to find out more about the CONCORD algorithm.

2. To find out more about the HPCC Systems® Machine Learning Library see the Machine Learning Library Reference.

3. Read Anshu Ranjan's blog to find out more about the Improving Child Query Processing project.

Interns working on HIPIE plugin and visualizations

I want to share some additional information about 2 other students who joined our intern program this summer.

Michael Tierney and Evan Sheridan are graduates from Trinity College Dublin. Michael's undergraduate degree was in Physics and Astrophysics. He is also moving on to starting an MSc in High Performance Computing. Evan has just completed his degree in Theoretical Physics and will be starting his MSc in Theoretical Modelling at Kings College, London in September.

Both students applied for a summer internship with ICHEC (Irish Centre for High End Computing). ICHEC provides an interface between academic/research institutions and industry to improve business productivity and international competitiveness for Ireland. We are happy and excited to collaborate with ICHEC by sponsoring 2 students to work on HPCC Systems® projects as part of their summer scholarship program.

Evan has been getting familiar with the HPCC Systems® Visualization Framework as an end user which has put him in a good position to improve the existing documentation from a users’ perspective. Evan’s project is a perfect example where the internship offered a real opportunity to learn something completely new and then move on to produce something worthwhile. The ultimate goal was to integrate the visualizations into the Eclipse plugin which despite the massive learning curve he managed to do. According to the original plan, Evan has now teamed up with Michael Tierney who has been getting up to speed with the Eclipse architecture and how plugins work in preparation for adding support to Eclipse for the HIPIE language (HPCC Integrated Plug In Engine). Michael has already produced an impressive working demo which has got a few people around here very excited!

The first screenshot shows an example of the HIPIE language within Eclipse (Michael’s work) and the second shows a live preview of the generated visualisations (Evan’s contribution). Both are displayed alongside each other on the screen but for purpose of this blog, I have split them so you can clearly see what is shown:

New tools and documentation have become available since we created our ECL Plugin for Eclipse and the mentor of this project, Gordon Smith, decided that it would be interesting to see what approach Michael would choose to use to develop this project from scratch, based on the new tools available to him. He chose a very different approach which has worked really well and is on target to produce an exciting and useful tool for HIPIE users.

Work is continuing on this project until the end of August.

Notes:
1. For more information about ICHEC, visit their website: https://www.ichec.ie/

2. Read Michaels Tierney's and Evan Sheridan's blogs to find out more details about their projects and experience.

3. Read more about HIPIE here: http://hpccsystems.com/blog/hipie-101-anatomy-of-a-dude-file

Quantile 2 - Test cases

When adding new features to the system, or changing the code generator, the first step is often to write some ECL test cases. They have proved very useful for several reasons:

  • Developing the test cases can help clarify issues, and other details that the implementation needs to take into account. (E.g., what happens if the input dataset is empty?)
  • They provide something concrete to aim towards when implementing the feature.
  • They provide a set of milestones to show progress.
  • They can be used to check the implementation on the different engines.

As part of the design discussion we also started to create a list of useful test cases (they follow below in the order they were discussed). The tests perform varying functions. Some of the tests are checking that the core functionality works correctly, while others check unusual situations and that strange boundary cases are covered. The tests are not exhaustive, but they are a good starting point and new tests can be added as the implementation progresses.

The following is the list of tests that should be created as part of implementing this activity:

  1. Compare with values extracted from a SORT.
    Useful to check the implementation, but also to ensure we clearly define which results we are expecting.
  2. QUANTILE with a number-of-ranges = 1, 0, and a very large number. Should also test the number of ranges can be dynamic as well as a constant.
  3. Empty dataset as input.
  4. All input entries are duplicates.
  5. Dataset smaller than number of ranges.
  6. Input sorted and reverse sorted.
  7. Normal data with small number of entries.
  8. Duplicates in the input dataset that cause empty ranges.
  9. Random distribution of numbers without duplicates.
  10. Local and grouped cases.
  11. SKEW that fails.
  12. Test scoring functions.
  13. Testing different skews that work on the same dataset.
  14. An example that uses all the keywords.
  15. Examples that do and do not have extra fields not included in the sort order. (Check that the unstable flag is correctly deduced.)
  16. Globally partitioned already (e.g., globally sorted). All partition points on a single node.
  17. Apply quantile to a dataset, and also to the same dataset that has been reordered/distributed. Check the resulting quantiles are the same.
  18. Calculate just the 5 and 95 centiles from a dataset.
  19. Check a non constant number of splits (and also in a child query where it depends on the parent row).
  20. A transform that does something interesting to the sort order. (Check any order is tracked correctly.)
  21. Check the counts are correct for grouped and local operations.
  22. Call in a child query with options that depend on the parent row (e.g., num partitions).
  23. Split points that fall in the middle of two items.
  24. No input rows and DEDUP attribute specified.

Ideally any test cases for features should be included in the runtime regression suite, which is found in the testing/regress directory in the github repository. Tests that check invalid syntax should go in the compiler regression suite (ecl/regress). Commit https://github.com/ghalliday/HPCC-Platform/commit/d75e6b40e3503f85126567... contains the test cases so far. Note, the test examples in that commit do not yet cover all the cases above. Before the final pull request for the feature is merged the list above should be revisited and the test suite extended to include any missing tests.

In practice it may be easier to write the test cases in parallel with implementing the parser - since that allows you to check their syntax. Some of the examples in the commit were created before work was started on the parser, others during, and some while implementing the feature itself.

Celebrating the 4th anniversary of the HPCC Systems® Open Source Project

A lot has happened in the last year which gives us good reason to celebrate this anniversary with great enthusiasm. Here are some highlights, from the many to be proud of, that I want to share with you…

In the last year, we have built on the 5.x.x series of releases completing many new features and improvements including adding to the list of third party languages, plugins and libraries now supported to include Cassandra, redis and memcached. In addition to this, the reading, writing and spraying of JSON files is now also supported. We’ve also integrated monitoring and alerting capabilities so you can keep a check on the health of your HPCC System and increase operational efficiency by pre-empting hardware issues and the ECL Watch facelift is almost complete. We have continued to extend and improve performance across our Roxie and Thor clusters.

In April this year, we launched a range of badges for use by our community members who leverage HPCC Systems®. High resolution versions of the badge most suited for your use are available for approved users. For more information, please contact media@hpccsystems.com.

Collaborations, presentations and testimonials

HPCC Systems® team members have spoken at a number of industry events in the last year. Bob Foreman presented a tutorial on ‘Big Data Processing with Less Work and Less Code’ at the Big Data TechCon in Boston where we were also a sponsor in the Exhibit Hall and Jesse Shaw spoke at the Big Data and Business Analytics Symposium at Wayne State University.

Presentations at meetups and other events are being delivered with increasing frequency in a variety of locations including Atlanta, Florida, Silicon Valley, Oklahoma, New York and more, while we also presented at the Alfresco Webinar with Forrester Research.

We are now collaborating with more academic institutions than ever including Kennesaw State University, North Carolina State University and Georgia Tech.

We are extremely pleased and excited to become an accepted organisation for Google Summer of Code this year.

It feels great that HPCC Systems® can rightfully take its place alongside some of the biggest names in the open source world. Our 2 GSoC students are working away at the projects they selected and you can read more about that here.

Our internship program is also thriving. You can read about this year’s students and their projects here. In total, we have opened our doors to 6 students from around the world working on HPCC Systems® related projects over this summer.

So what does all this mean for us in terms of our market presence? Well, it can’t be a coincidence that after all this dedicated effort by the team, our social media presence is steadily growing, the number of registered users of our website has increased by one third and we are seeing more and more people taking our training classes both instructor led and online. The number of downloads of the Virtual Machine this year in comparison with last year has quadrupled which means that more people than ever are trying out HPCC Systems®. Users have also shared with us their experiences of how they have used HPCC Systems® in their research or to solve their Big Data problems. Find out what they have to say by listening to their testimonials.

Roll on to our 5th anniversary and the successes and achievements to come...

This really has been a great year worth celebrating! So what’s next?

HPCC Systems® is a vibrant, growing project that has its sights firmly set on the future so there is still much to do.

We are almost ready to graduate to the next generation of HPCC Systems® 6.0. But before we do, we have one last round of updates for you, so look out for the release of HPCC Systems® 5.4.0 over the summer. This release will include some code generator optimisations, some Thor performance improvements in the files services/child query handling area, Nagios integration into ECL Watch, some init system improvements as well as many fixes for issues and suggestions raised by you, our users. By late autumn, we will be ready to move on to HPCC Systems® 6.0 and we already have ideas waiting in the wings for HPCC Systems® 7.0.

So congratulations to the HPCC Systems® team and thank you to all our Partners, contributors and users. We couldn’t have done it without you!

Welcome to the HPCC Systems® Summer Interns 2015!

While this is the first year we have run a GSoC (Google Summer of Code) program, it is not the first year we have run an intern program. We have a number of affiliations with universities in the US including Florida State, FAU, North Carolina State, Clemson, Georgia Tech and more including University College, London in the UK. Students have successfully completed projects for us particularly in the Machine Learning area, including the coding of decision trees and random forest and the porting of logistic regression to PB-BLAS. We currently have a student working with us from Florida Atlantic University, who is developing multi-layer perceptrons, back propagation and deep learning.

Working with interns has been a good experience for us and is something we will continue to do. We are therefore pleased to be mentoring 2 students who will be working on HPCC System® projects this summer. Both projects originated as GSoC 2015 proposals but since we did not have enough slots to accept them, we have included them in our summer intern program.

Machine Learning - CONCORD Algorithm
Syed Rahman is working on this Machine Learning project. Syed’s GSoC proposal was particularly interesting to us because it was an idea that he had developed himself to Implement High Dimensional Covariance Estimate Algorithms in ECL. Syed is studying for a PhD in Statistics at the University of Florida. The mentor for this project is John Holt who is one of the founders of the HPCC Systems Machine Learning Library. The CONCORD algorithm Syed has suggested will be a noteworthy addition to our ML Library adding real value. Correlations are extremely useful in the task of data analysis and working efficiently with high dimensional data is critical in many ML applications.

Syed has been preparing the way for successfully implementing this project by getting to grips with running the HPCC Systems® platform, learning ECL, as well as refining his development plan.

Code Generator - Child Querys
Anshu Ranjan will be working on the HPCC Systems® platform project Improve Child Query Processing. This project involves delving into the code generator which is a highly specific and complex area. The mentor for this project is Gavin Halliday who is the ‘keeper of the keys’ to the code generator, so Anshu will have access to the best guidance and knowledge possible. Anshu is studying for a PhD in Computing Engineering at the University of Florida.

This is an important project addressing some long standing issues that will help us to improve the speed and reduce the generated code size for complex queries that perform significant processing on child datasets. Anshu has been preparing for the coding period by improving his understanding of the platform and working on some of our online training courses.

Evaluations will be due for interns according to the same schedule as GSoC so look out for an update on progress and milestones achieved sometime in July.

Project ideas and contributions are welcome
Projects ideas that didn’t make it either for GSoC or the summer intern program this year will be reviewed and may stay on the list for 2016. Other new interesting projects will also be added later this year. We are, of course, open to suggestions and requests via the HPCC Systems® Community Forums or students may contact one of our mentors by email using the details supplied on our GSoC Wiki here: https://wiki.hpccsystems.com/display/hpcc/Mentor+list+and+testimonials.

As a result of both student programs, we hope to complete a few more projects of value to our open source community this year. Students are also potential new, young developers to add to the HPCC Systems® team in the future. We want to encourage them to stay in touch once they have completed their program with us. Mentors will also want to keep in touch with students from time to time keeping the communication links open, finding out how they are progressing with their studies and checking on their availability for further contributions.

HPCC Systems® is an open source project after all so we want to encourage contributions from outside our team. In all honesty, what can be better than attracting new, upcoming talent from the best universities and colleges!

Note:
1. The HPCC Systems® Summer Internship runs for 10 weeks beginning at the start of June and ending the first week of August. For more information contact Molly O'Neal who administers the program.

2. For more information about contributing to the HPCC Systems® code base, go to the Contributions area on this website: http://hpccsystems.com/community/contributions.

3. If you want to dive right in and resolve an outstanding issue, go the Community Issue Tracker (JIRA): https://track.hpccsystems.com/secure/Dashboard.jspa. Create yourself an account and search for issues with the Assignee field set to Available for Anyone to get some contribution ideas. Either post your interest in the Comments section and a developer will get back to you, or email Lorraine Chapman.

Google Summer of Code 2015 (GSoC) – Let the coding begin!

We have been through the various preliminary stages of the GSoC process and now find ourselves at what is the most exciting part. The Community Bonding Period is over and now the real work has started. During the last month, the students have been getting to know their mentor and making sure they have everything they need to start coding. They have also had end of year examinations, so it has been a busy time all round.

As a first time organisation, we were allocated 2 slots. We had a successful student proposal period receiving 50 proposals across many of the projects on our GSoC Ideas List. As you can imagine, we had many more good proposals than slots and in particular, the machine learning projects were very popular. We certainly need more ideas in this area for next year!

Participating in GSoC provides the perfect opportunity to appeal to a large number of motivated students who are interested in coding and working on a team where their contribution matters not only to themselves but also to our project. We will certainly apply to be an accepted organisation again in 2016.

So how did we choose?

GSoC is viewed by Google as primarily a student learning experience, where they can work alongside real developers on a real project, learning good working practices in preparation for employment in the field post study. Obviously, everyone involved wants the projects to be successful. We want students to enjoy the experience while learning a lot. Students will gain great confidence working on a successful project while also seeing their work integrated into a platform that is actively used in a business environment. So while looking at proposals mentors considered the potential for success including communication skills, ability to listen, reasoning ability as well as knowledge and experience. It was a truly collaborative effort by the HPCC Systems® platform team.

Our first decision was to allocate one slot to a machine learning project and the other to an HPCC Systems® platform project. Next we rated proposals bearing in mind the factors I just mentioned as well as responses to comments and suggestions which helped us to gain some idea of the level of interest and commitment. At this stage, the list had shortened considerably and only we had to make the difficult choice between the impressive proposals that were left. This makes it sound easy, but when you have a number of great proposals and 2 slots, it really isn’t easy at all! Nevertheless, the decision had to be made and here are the results.

Introducing the HPCC Systems GSoC 2015 Projects and Students

The first slot was allocated to the machine learning project Add new statistics to the Linear and Logistic Regression Module. Tim Humphrey is not only the mentor for this project, he also suggested it. He’s the custodian of the HPCC Systems® Machine Learning Library and is interested in extending the current capabilities of the Logistic and Linear Regression Module. By adding some performance statistics, users will be able to measure the efficiency of their model, making this a valuable contribution to the existing module.

Since Machine Learning is a complex area requiring in depth statistical knowledge and analysis, we needed someone with some experience who had also done their homework using the resources we had supplied to get to know our ML Library and the HPCC Systems® platform. The algorithms need to be written in ECL so the student would need to familiarise themselves with and understand the ECL language using our online learning material.

We accepted an excellent proposal from Sarthak Jain who is studying for a Bachelor of Technology in Computing Engineering at the Delhi Technological University. Sarthak has started working on this project and has already completed the statistics for the Sparse Linear Regression part of the work required. This is a great start!

We allocated the second slot to the Expand the HPCC Systems® Visualization Framework project. The mentor for this project is Gordon Smith who is the manager of the HPCC Systems® supercomputer clients and the principal developer of our ECL related tools. He has been working on the visualization framework for some time now alongside others and is well placed to guide and support Anmol Jagetia, who we accepted to complete this project. Anmol is studying for a Bachelor of Technology in Information Technology at the Indian Institute of Information Technology in Allahabad. As well as Anmol’s technical skills, we were particularly impressed by his eye for aesthetics which is vital to a project producing visual representations of data to users.

He has already made good progress learning about the code base. He moved quickly on to resolving an outstanding issue suggested by Gordon and has since started work on one of his long term goals which is to add Gantt charts into the framework. He has started a blog journal which provides an interesting account of his experience and work tasks, you can find it here: http://blog.anmoljagetia.me/gsoc-journal/.

Both GSoC students have hit the ground running and are off to a good start. In late June/early July, mentors and students must submit mid-term evaluations to Google. By this stage, the projects will be well underway and there’ll be more news to pass on via this blog.

We could really have done with more slots than we were allocated for GSoC and we hope that if we are accepted as a returning organisation next year, we will be in a position to get the number of slots we need. There were a number of excellent proposals that we would have liked to accept so while we had to reject them for GSoC we decided to convert 2 of them into projects suitable for our summer intern program. More on this to come….

Notes:
1. GSoC is run by Google and proposals can only be accepted via the Google Melange interface during the designated period indicated on the GSoC website for the year you are applying: https://www.google-melange.com/gsoc/homepage/google/gsoc2015

2.You can find the HPCC Systems GSoC Wiki and Ideas List here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Wiki

What does it take to add a new activity?

This series of blog posts started life as a series of walk-throughs and brainstorming sessions at a team offsite. This series will look at adding a new activity to the system. The idea is to give an walk through of the work involved, to highlight the different areas that need changing, and hopefully encourage others to add their own activities. In parallel with the description in this blog there is a series of commits to the github repository that correspond to the different stages in adding the activity. Once the blog is completed, the text will also be checked into the source control tree for future reference.

The new activity is going to be a QUANTILE activity, which can be used to find the records that split a dataset into equal sized blocks. Two common uses are to find the median of a set of data (split into 2) or percentiles (split into 100). It can also be used to split a dataset for distribution across the nodes in a system. One hope is that the classes used to implement quantile in Thor can also be used to improve the performance of the global sort operation.

It may seem fatuous, but the first task in adding any activity to the system is to work out what that activity is going to do! You can approach this in an iterative manner - starting with a minimal set of functionality and adding options as you think of them - or start with a more complete initial design. We have used both approaches in the past to add capabilities to the HPCC system, but on this occasion we will be starting from a more complete design - the conclusion of our initial design discussion:

"What are the inputs, options and capabilities that might be useful in a QUANTILE activity?"

The discussion produced the following items:

  • Which dataset is being processed?
    This is always required and should be the first argument to the activity.
  • How many parts to split the dataset into?
    This is always required, so it should be the next argument to the activity.
  • Which fields are being used to order (and split) the dataset?
    Again this is always required, so the list of fields should follow the number of partitions.
  • Which fields are returned?
    Normally the input row, but often it would be useful for the output to include details of which quantile a row corresponds to. To allow this an optional transform could be passed the input row as LEFT and the quantile number as COUNTER.
  • How about first and last rows in the dataset?
    Sometimes it is also useful to know the first and last rows. Add flags to allow them to be optionally returned.
  • How do you cope with too few input rows (including an empty input)?

After some discussion we decided that QUANTILE should always return the number of parts requested. If there were fewer items in the input they would be duplicated as appropriate. We should provide a DEDUP flag for the situations when that is not desired.
If there is an empty dataset as input then the default (blank) row will be created.

  • Should all rows have the same weighting?
    Generally you want the same weighting for each row. However, if you are using QUANTILE to split your dataset, and the cost of the next operation depends on some feature of the row (e.g., the frequency of the firstname) then you may want to
    weight the rows differently.
  • What if we are only interested in the 5th and 95th centiles?
    We could optionally allow a set of values to be selected from the results.

There were also some implementation details concluded from the discussions:

  • How accurate should the results be?
    The simplest implementation of QUANTILE (sort and then select the correct rows) will always produce accurate results. However, there may be some implementations that can produce an approximate answer more quickly. Therefore we could add a SKEW attribute to allow early termination.
  • Does the implementation need to be stable?
    In other words, if there are rows with identical values for the ordering fields, but other fields not part of the ordering with different values, does it matter which of those rows are returned? Does the relative order within those matching rows matter?
    The general principle in the HPCC system is that sort operations should be stable, and that where possible activities return consistent, reproducible results. However, that often has a cost - either in performance or memory consumption. The design discussion highlighted the fact that if all the fields from the row are included in the sort order then the relative order does not matter because the duplicate rows will be indistinguishable. (This is also true for sorts, and following the discussion an optimization was added to 5.2 to take advantage of this.) For the QUANTILE activity we will add an ECL flag, but the code generator should also aim to spot this automatically.
  • Returning counts of the numbers in each quantile might be interesting.
    This has little value when the results are exact, but may be more useful when a SKEW is specified to allow an approximate answer, or if a dataset might have a vast numbers of duplicates. It is possibly something to add to a future version of the activity. For an approximate answer, calculating the counts is likely to add an additional cost to the implementation, so the target engine should be informed if this is required.
  • Is the output always sorted by the partition fields?
    If this naturally falls out of the implementations then it would be worth including it in the specification. Initially we will assume not, but will revisit after it has been implemented.

After all the discussions we arrived at the following syntax:


QUANTILE(<dataset>, <number-of-ranges>, { sort-order } [, <transform>(LEFT, COUNTER)]
[,FIRST][,LAST][,SKEW(<n>)][,UNSTABLE][,SCORE(<score>)][,RANGE(set)][,DEDUP][,LOCAL]

FIRST - Match the first row in the input dataset (as quantile 0)
LAST - Match the last row in the input dataset (as quantile )
SKEW - The maximum deviation from the correct results allowed. Defaults to 0.
UNSTABLE - Is the order of the original input values unimportant?
SCORE - What weighting should be applied for each row. Defaults to 1.
RANGE - Which quantiles should actually be returned. (Defaults to ALL).
DEDUP - Avoid returning a match for an input row more than once.

We also summarised a few implementation details:

  • The activity needs to be available in GLOBAL, LOCAL and GROUPED variants.
  • The code generator should derive UNSTABLE if no non-sort fields are returned.
  • Flags to indicate if a score/range is required.
  • Flag to indicate if a transform is required.

Finally, deciding on the name of the activity took almost as long as designing it!

The end result of this process was summarised in a JIRA issue: https://track.hpccsystems.com/browse/HPCC-12267, which contains details of the desired syntax and semantics. It also contains some details of the next blog topic - test cases.

Incidentally, a question that arose from of the design discussion was "What ECL can we use if we want to annotate a dataset with partition points?". Ideally the user needs a join activity which walks through a table of rows, and matches against the first row that contains key values less than or equal to the values in the search row. There are other situations where that operation would also be useful. Our conclusion was that the system does not have a simple way to achieve that, and that it was a deficiency in the current system, so another JIRA was created (see https://track.hpccsystems.com/browse/HPCC-13016). This is often how the design discussions proceed, with discussions in one area leading to new ideas in another. Similarly we concluded it would be useful to distribute rows in a dataset based on a partition (see https://track.hpccsystems.com/browse/HPCC-13260).

HIPIE 101 – Anatomy of a DUDE file

Note: This entry pertains to HIPIE which is an upcoming paid module for HPCC Systems Enterprise Edition Version.

As someone that was once introduced at an international programmers conference as a “nerd’s nerd” it was very interesting to leave a conference with the business people looking happy and the technical folks looking confused. Yet this was the feat that we achieved at the recent LexisNexis Visual Analytics Symposium. We were able to show that we have transformed a kick-ass big data platform into a high productivity visual analytics platform; but for those that know how HPCC Systems works – it was not at all clear how we had done it. The purpose of this blog series is to address some of the gap.

The hidden secret behind all of the magical looking visual tools is HIPIE – the HPCC Integrated Plug In Engine. At a most fundamental level, HIPIE exists to allow non-programmers to assemble a series of ECL Macros into a working program. Critical to this, is it requires the writer of the macro to describe the macro behavior in a detailed and specific matter. This description is referred to as a ‘contract’. This contract is written in DUDE. ‘Dude’ is not an acronym; dude is ‘hippie-speak’.

The rest of this blog entry is a primer on the structure of a DUDE file. Information on some of the individual pieces will come later.

The first part of a DUDE file is the meta-information; what the plugin is called, what it does and who can do what to it. The plugin writer has total control over what they expose – and total responsibility for exposing it.

Next comes the INPUTS section:

The INPUTS section gets translated into an HTML form which will be used to extract information from the user. There are lots of opportunities for prompting, validation etc. The ‘FIELD’ verb above specifies that after the user has selected a DATASET – one of the data fields should be selected too. The INPUTS section will typically be used to gather the input parameters to the macro being called.

After INPUTS come OUTPUTS – and this is where the magic starts:

This particular plugin provides THREE OUTPUTS. The first of these (dsOutput) is the ‘real’ output and a number of things need to be observed:

  1. dsOutput(dsInput) means that the output is going to be the input file + all of the following fields
  2. ,APPEND says the macro appends columns to the file, but does not delete any rows or any columns and does not change any of them. HIPIE verifies this is true and errors if it is not
  3. PREFIX(INPUTS.Prefix) allows the user to specify an EXTRA prefix before parse_email. This allows a plugin to be used multiple times on the same underlying file.

The other two have the ‘: SIDE’ indicator. A major part of HIPIEism is the notion that even a process that is wrapped inside a contract ought to give LOTS of aggregative information out to show HOW well the black box performed. SIDE outputs can be thought of as ‘diagnostic tables’.

Next comes the piece that has everyone most excited:

Any output (although usually it will be a side effect) can have a visualization defined. A single visualization correlates to a page of a dashboard. Each line of the VISUALIZE corresponds to one widget on the screen. The definition defines the side effect being visualized and how the visualization should look in the dashboard. The same definition also shows how the dashboard should interact (see the SELECTS option).

Finally comes the GENERATES section – this may be a little intimidating – although really it is mainly ECL:

The way to think of this is:

  1. It all eventually has to be ECL
  2. %blarg% means that ‘blarg’ was a variable used in the input section and whatever was filled in there is placed into the %blarg% before the ECL is produced.
  3. %^ means ‘HIPIE is going to do something a little weird with this label’. In the case of ^e it generates an ‘EXPORT’ but also ensures that the label provided is unique between plugins

In summary – a HIPIE plugin is defined by a DUDE file. The DUDE file has five sections:

  • META DATA – what does the plugin do / who can use it
  • INPUTS – what the plugin user must tell me to enable me to execute
  • OUTPUTS – information about the data I put out (including side-effects)
  • VISUALIZE – what is a good way to view my side effects
  • GENERATES – a template to generate the ECL that constitutes the ‘guts’ of the plugin

In the next blog entry, we will answer the question: how do you drive the interactive dashboard (VISUALIZE)?

Definition dependencies

As your body of ECL code grows it gets harder to track the dependencies between the different ECL definitions (or source files). Providing more information about the dependencies between those definitions makes it easier to understand the structure of the ECL code, and also gives you a better understanding of what queries would be affected by changing a particular definition. (i.e., If I change this, what am I going to break?)

Version 5.2 has a new option to allow that information to be available for each query that is run. When the option is enabled a new entry will appear on the helpers tab for the work unit - a link to an xml file containing all the dependencies. (Note the dependencies are gathered at parse time – so they will include any definition that is processed in order to parse the ECL – even if code from that definition is not actually included in the generated query.)

To generate the new information set the debug option 'exportDependencies' in the debug options for the workunit. To enable this feature on all workunits (and gain dependencies for all your workunits) you can add it to the eclserver default options. (A #option is not sufficient because it needs to be processed before parsing commences).

This information gives us the option to possibly add dependency graphs, and searches for all workunits that use a particular attribute to future versions of EclWatch. Of course the same information could also be used now by a user tool, or other 3rd party extension…

Contact Us

email us   Email us
Toll-free   US: 1.877.316.9669
International   Intl: 1.678.694.2200

Sign up to get updates through
our social media channels:

facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

Get Started