Blogs

Quantile 2 - Test cases

When adding new features to the system, or changing the code generator, the first step is often to write some ECL test cases. They have proved very useful for several reasons:

  • Developing the test cases can help clarify issues, and other details that the implementation needs to take into account. (E.g., what happens if the input dataset is empty?)
  • They provide something concrete to aim towards when implementing the feature.
  • They provide a set of milestones to show progress.
  • They can be used to check the implementation on the different engines.

As part of the design discussion we also started to create a list of useful test cases (they follow below in the order they were discussed). The tests perform varying functions. Some of the tests are checking that the core functionality works correctly, while others check unusual situations and that strange boundary cases are covered. The tests are not exhaustive, but they are a good starting point and new tests can be added as the implementation progresses.

The following is the list of tests that should be created as part of implementing this activity:

  1. Compare with values extracted from a SORT.
    Useful to check the implementation, but also to ensure we clearly define which results we are expecting.
  2. QUANTILE with a number-of-ranges = 1, 0, and a very large number. Should also test the number of ranges can be dynamic as well as a constant.
  3. Empty dataset as input.
  4. All input entries are duplicates.
  5. Dataset smaller than number of ranges.
  6. Input sorted and reverse sorted.
  7. Normal data with small number of entries.
  8. Duplicates in the input dataset that cause empty ranges.
  9. Random distribution of numbers without duplicates.
  10. Local and grouped cases.
  11. SKEW that fails.
  12. Test scoring functions.
  13. Testing different skews that work on the same dataset.
  14. An example that uses all the keywords.
  15. Examples that do and do not have extra fields not included in the sort order. (Check that the unstable flag is correctly deduced.)
  16. Globally partitioned already (e.g., globally sorted). All partition points on a single node.
  17. Apply quantile to a dataset, and also to the same dataset that has been reordered/distributed. Check the resulting quantiles are the same.
  18. Calculate just the 5 and 95 centiles from a dataset.
  19. Check a non constant number of splits (and also in a child query where it depends on the parent row).
  20. A transform that does something interesting to the sort order. (Check any order is tracked correctly.)
  21. Check the counts are correct for grouped and local operations.
  22. Call in a child query with options that depend on the parent row (e.g., num partitions).
  23. Split points that fall in the middle of two items.
  24. No input rows and DEDUP attribute specified.

Ideally any test cases for features should be included in the runtime regression suite, which is found in the testing/regress directory in the github repository. Tests that check invalid syntax should go in the compiler regression suite (ecl/regress). Commit https://github.com/ghalliday/HPCC-Platform/commit/d75e6b40e3503f85126567... contains the test cases so far. Note, the test examples in that commit do not yet cover all the cases above. Before the final pull request for the feature is merged the list above should be revisited and the test suite extended to include any missing tests.

In practice it may be easier to write the test cases in parallel with implementing the parser - since that allows you to check their syntax. Some of the examples in the commit were created before work was started on the parser, others during, and some while implementing the feature itself.

Celebrating the 4th anniversary of the HPCC Systems® Open Source Project

A lot has happened in the last year which gives us good reason to celebrate this anniversary with great enthusiasm. Here are some highlights, from the many to be proud of, that I want to share with you…

In the last year, we have built on the 5.x.x series of releases completing many new features and improvements including adding to the list of third party languages, plugins and libraries now supported to include Cassandra, redis and memcached. In addition to this, the reading, writing and spraying of JSON files is now also supported. We’ve also integrated monitoring and alerting capabilities so you can keep a check on the health of your HPCC System and increase operational efficiency by pre-empting hardware issues and the ECL Watch facelift is almost complete. We have continued to extend and improve performance across our Roxie and Thor clusters.

In April this year, we launched a range of badges for use by our community members who leverage HPCC Systems®. High resolution versions of the badge most suited for your use are available for approved users. For more information, please contact media@hpccsystems.com.

Collaborations, presentations and testimonials

HPCC Systems® team members have spoken at a number of industry events in the last year. Bob Foreman presented a tutorial on ‘Big Data Processing with Less Work and Less Code’ at the Big Data TechCon in Boston where we were also a sponsor in the Exhibit Hall and Jesse Shaw spoke at the Big Data and Business Analytics Symposium at Wayne State University.

Presentations at meetups and other events are being delivered with increasing frequency in a variety of locations including Atlanta, Florida, Silicon Valley, Oklahoma, New York and more, while we also presented at the Alfresco Webinar with Forrester Research.

We are now collaborating with more academic institutions than ever including Kennesaw State University, North Carolina State University and Georgia Tech.

We are extremely pleased and excited to become an accepted organisation for Google Summer of Code this year.

It feels great that HPCC Systems® can rightfully take its place alongside some of the biggest names in the open source world. Our 2 GSoC students are working away at the projects they selected and you can read more about that here.

Our internship program is also thriving. You can read about this year’s students and their projects here. In total, we have opened our doors to 6 students from around the world working on HPCC Systems® related projects over this summer.

So what does all this mean for us in terms of our market presence? Well, it can’t be a coincidence that after all this dedicated effort by the team, our social media presence is steadily growing, the number of registered users of our website has increased by one third and we are seeing more and more people taking our training classes both instructor led and online. The number of downloads of the Virtual Machine this year in comparison with last year has quadrupled which means that more people than ever are trying out HPCC Systems®. Users have also shared with us their experiences of how they have used HPCC Systems® in their research or to solve their Big Data problems. Find out what they have to say by listening to their testimonials.

Roll on to our 5th anniversary and the successes and achievements to come...

This really has been a great year worth celebrating! So what’s next?

HPCC Systems® is a vibrant, growing project that has its sights firmly set on the future so there is still much to do.

We are almost ready to graduate to the next generation of HPCC Systems® 6.0. But before we do, we have one last round of updates for you, so look out for the release of HPCC Systems® 5.4.0 over the summer. This release will include some code generator optimisations, some Thor performance improvements in the files services/child query handling area, Nagios integration into ECL Watch, some init system improvements as well as many fixes for issues and suggestions raised by you, our users. By late autumn, we will be ready to move on to HPCC Systems® 6.0 and we already have ideas waiting in the wings for HPCC Systems® 7.0.

So congratulations to the HPCC Systems® team and thank you to all our Partners, contributors and users. We couldn’t have done it without you!

Welcome to the HPCC Systems® Summer Interns 2015!

While this is the first year we have run a GSoC (Google Summer of Code) program, it is not the first year we have run an intern program. We have a number of affiliations with universities in the US including Florida State, FAU, North Carolina State, Clemson, Georgia Tech and more including University College, London in the UK. Students have successfully completed projects for us particularly in the Machine Learning area, including the coding of decision trees and random forest and the porting of logistic regression to PB-BLAS. We currently have a student working with us from Florida Atlantic University, who is developing multi-layer perceptrons, back propagation and deep learning.

Working with interns has been a good experience for us and is something we will continue to do. We are therefore pleased to be mentoring 2 students who will be working on HPCC System® projects this summer. Both projects originated as GSoC 2015 proposals but since we did not have enough slots to accept them, we have included them in our summer intern program.

Machine Learning - CONCORD Algorithm
Syed Rahman is working on this Machine Learning project. Syed’s GSoC proposal was particularly interesting to us because it was an idea that he had developed himself to Implement High Dimensional Covariance Estimate Algorithms in ECL. Syed is studying for a PhD in Statistics at the University of Florida. The mentor for this project is John Holt who is one of the founders of the HPCC Systems Machine Learning Library. The CONCORD algorithm Syed has suggested will be a noteworthy addition to our ML Library adding real value. Correlations are extremely useful in the task of data analysis and working efficiently with high dimensional data is critical in many ML applications.

Syed has been preparing the way for successfully implementing this project by getting to grips with running the HPCC Systems® platform, learning ECL, as well as refining his development plan.

Code Generator - Child Querys
Anshu Ranjan will be working on the HPCC Systems® platform project Improve Child Query Processing. This project involves delving into the code generator which is a highly specific and complex area. The mentor for this project is Gavin Halliday who is the ‘keeper of the keys’ to the code generator, so Anshu will have access to the best guidance and knowledge possible. Anshu is studying for a PhD in Computing Engineering at the University of Florida.

This is an important project addressing some long standing issues that will help us to improve the speed and reduce the generated code size for complex queries that perform significant processing on child datasets. Anshu has been preparing for the coding period by improving his understanding of the platform and working on some of our online training courses.

Evaluations will be due for interns according to the same schedule as GSoC so look out for an update on progress and milestones achieved sometime in July.

Project ideas and contributions are welcome
Projects ideas that didn’t make it either for GSoC or the summer intern program this year will be reviewed and may stay on the list for 2016. Other new interesting projects will also be added later this year. We are, of course, open to suggestions and requests via the HPCC Systems® Community Forums or students may contact one of our mentors by email using the details supplied on our GSoC Wiki here: https://wiki.hpccsystems.com/display/hpcc/Mentor+list+and+testimonials.

As a result of both student programs, we hope to complete a few more projects of value to our open source community this year. Students are also potential new, young developers to add to the HPCC Systems® team in the future. We want to encourage them to stay in touch once they have completed their program with us. Mentors will also want to keep in touch with students from time to time keeping the communication links open, finding out how they are progressing with their studies and checking on their availability for further contributions.

HPCC Systems® is an open source project after all so we want to encourage contributions from outside our team. In all honesty, what can be better than attracting new, upcoming talent from the best universities and colleges!

Note:
1. The HPCC Systems® Summer Internship runs for 10 weeks beginning at the start of June and ending the first week of August. For more information contact Molly O'Neal who administers the program.

2. For more information about contributing to the HPCC Systems® code base, go to the Contributions area on this website: http://hpccsystems.com/community/contributions.

3. If you want to dive right in and resolve an outstanding issue, go the Community Issue Tracker (JIRA): https://track.hpccsystems.com/secure/Dashboard.jspa. Create yourself an account and search for issues with the Assignee field set to Available for Anyone to get some contribution ideas. Either post your interest in the Comments section and a developer will get back to you, or email Lorraine Chapman.

Google Summer of Code 2015 (GSoC) – Let the coding begin!

We have been through the various preliminary stages of the GSoC process and now find ourselves at what is the most exciting part. The Community Bonding Period is over and now the real work has started. During the last month, the students have been getting to know their mentor and making sure they have everything they need to start coding. They have also had end of year examinations, so it has been a busy time all round.

As a first time organisation, we were allocated 2 slots. We had a successful student proposal period receiving 50 proposals across many of the projects on our GSoC Ideas List. As you can imagine, we had many more good proposals than slots and in particular, the machine learning projects were very popular. We certainly need more ideas in this area for next year!

Participating in GSoC provides the perfect opportunity to appeal to a large number of motivated students who are interested in coding and working on a team where their contribution matters not only to themselves but also to our project. We will certainly apply to be an accepted organisation again in 2016.

So how did we choose?

GSoC is viewed by Google as primarily a student learning experience, where they can work alongside real developers on a real project, learning good working practices in preparation for employment in the field post study. Obviously, everyone involved wants the projects to be successful. We want students to enjoy the experience while learning a lot. Students will gain great confidence working on a successful project while also seeing their work integrated into a platform that is actively used in a business environment. So while looking at proposals mentors considered the potential for success including communication skills, ability to listen, reasoning ability as well as knowledge and experience. It was a truly collaborative effort by the HPCC Systems® platform team.

Our first decision was to allocate one slot to a machine learning project and the other to an HPCC Systems® platform project. Next we rated proposals bearing in mind the factors I just mentioned as well as responses to comments and suggestions which helped us to gain some idea of the level of interest and commitment. At this stage, the list had shortened considerably and only we had to make the difficult choice between the impressive proposals that were left. This makes it sound easy, but when you have a number of great proposals and 2 slots, it really isn’t easy at all! Nevertheless, the decision had to be made and here are the results.

Introducing the HPCC Systems GSoC 2015 Projects and Students

The first slot was allocated to the machine learning project Add new statistics to the Linear and Logistic Regression Module. Tim Humphrey is not only the mentor for this project, he also suggested it. He’s the custodian of the HPCC Systems® Machine Learning Library and is interested in extending the current capabilities of the Logistic and Linear Regression Module. By adding some performance statistics, users will be able to measure the efficiency of their model, making this a valuable contribution to the existing module.

Since Machine Learning is a complex area requiring in depth statistical knowledge and analysis, we needed someone with some experience who had also done their homework using the resources we had supplied to get to know our ML Library and the HPCC Systems® platform. The algorithms need to be written in ECL so the student would need to familiarise themselves with and understand the ECL language using our online learning material.

We accepted an excellent proposal from Sarthak Jain who is studying for a Bachelor of Technology in Computing Engineering at the Delhi Technological University. Sarthak has started working on this project and has already completed the statistics for the Sparse Linear Regression part of the work required. This is a great start!

We allocated the second slot to the Expand the HPCC Systems® Visualization Framework project. The mentor for this project is Gordon Smith who is the manager of the HPCC Systems® supercomputer clients and the principal developer of our ECL related tools. He has been working on the visualization framework for some time now alongside others and is well placed to guide and support Anmol Jagetia, who we accepted to complete this project. Anmol is studying for a Bachelor of Technology in Information Technology at the Indian Institute of Information Technology in Allahabad. As well as Anmol’s technical skills, we were particularly impressed by his eye for aesthetics which is vital to a project producing visual representations of data to users.

He has already made good progress learning about the code base. He moved quickly on to resolving an outstanding issue suggested by Gordon and has since started work on one of his long term goals which is to add Gantt charts into the framework. He has started a blog journal which provides an interesting account of his experience and work tasks, you can find it here: http://blog.anmoljagetia.me/gsoc-journal/.

Both GSoC students have hit the ground running and are off to a good start. In late June/early July, mentors and students must submit mid-term evaluations to Google. By this stage, the projects will be well underway and there’ll be more news to pass on via this blog.

We could really have done with more slots than we were allocated for GSoC and we hope that if we are accepted as a returning organisation next year, we will be in a position to get the number of slots we need. There were a number of excellent proposals that we would have liked to accept so while we had to reject them for GSoC we decided to convert 2 of them into projects suitable for our summer intern program. More on this to come….

Notes:
1. GSoC is run by Google and proposals can only be accepted via the Google Melange interface during the designated period indicated on the GSoC website for the year you are applying: https://www.google-melange.com/gsoc/homepage/google/gsoc2015

2.You can find the HPCC Systems GSoC Wiki and Ideas List here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Wiki

What does it take to add a new activity?

This series of blog posts started life as a series of walk-throughs and brainstorming sessions at a team offsite. This series will look at adding a new activity to the system. The idea is to give an walk through of the work involved, to highlight the different areas that need changing, and hopefully encourage others to add their own activities. In parallel with the description in this blog there is a series of commits to the github repository that correspond to the different stages in adding the activity. Once the blog is completed, the text will also be checked into the source control tree for future reference.

The new activity is going to be a QUANTILE activity, which can be used to find the records that split a dataset into equal sized blocks. Two common uses are to find the median of a set of data (split into 2) or percentiles (split into 100). It can also be used to split a dataset for distribution across the nodes in a system. One hope is that the classes used to implement quantile in Thor can also be used to improve the performance of the global sort operation.

It may seem fatuous, but the first task in adding any activity to the system is to work out what that activity is going to do! You can approach this in an iterative manner - starting with a minimal set of functionality and adding options as you think of them - or start with a more complete initial design. We have used both approaches in the past to add capabilities to the HPCC system, but on this occasion we will be starting from a more complete design - the conclusion of our initial design discussion:

"What are the inputs, options and capabilities that might be useful in a QUANTILE activity?"

The discussion produced the following items:

  • Which dataset is being processed?
    This is always required and should be the first argument to the activity.
  • How many parts to split the dataset into?
    This is always required, so it should be the next argument to the activity.
  • Which fields are being used to order (and split) the dataset?
    Again this is always required, so the list of fields should follow the number of partitions.
  • Which fields are returned?
    Normally the input row, but often it would be useful for the output to include details of which quantile a row corresponds to. To allow this an optional transform could be passed the input row as LEFT and the quantile number as COUNTER.
  • How about first and last rows in the dataset?
    Sometimes it is also useful to know the first and last rows. Add flags to allow them to be optionally returned.
  • How do you cope with too few input rows (including an empty input)?

After some discussion we decided that QUANTILE should always return the number of parts requested. If there were fewer items in the input they would be duplicated as appropriate. We should provide a DEDUP flag for the situations when that is not desired.
If there is an empty dataset as input then the default (blank) row will be created.

  • Should all rows have the same weighting?
    Generally you want the same weighting for each row. However, if you are using QUANTILE to split your dataset, and the cost of the next operation depends on some feature of the row (e.g., the frequency of the firstname) then you may want to
    weight the rows differently.
  • What if we are only interested in the 5th and 95th centiles?
    We could optionally allow a set of values to be selected from the results.

There were also some implementation details concluded from the discussions:

  • How accurate should the results be?
    The simplest implementation of QUANTILE (sort and then select the correct rows) will always produce accurate results. However, there may be some implementations that can produce an approximate answer more quickly. Therefore we could add a SKEW attribute to allow early termination.
  • Does the implementation need to be stable?
    In other words, if there are rows with identical values for the ordering fields, but other fields not part of the ordering with different values, does it matter which of those rows are returned? Does the relative order within those matching rows matter?
    The general principle in the HPCC system is that sort operations should be stable, and that where possible activities return consistent, reproducible results. However, that often has a cost - either in performance or memory consumption. The design discussion highlighted the fact that if all the fields from the row are included in the sort order then the relative order does not matter because the duplicate rows will be indistinguishable. (This is also true for sorts, and following the discussion an optimization was added to 5.2 to take advantage of this.) For the QUANTILE activity we will add an ECL flag, but the code generator should also aim to spot this automatically.
  • Returning counts of the numbers in each quantile might be interesting.
    This has little value when the results are exact, but may be more useful when a SKEW is specified to allow an approximate answer, or if a dataset might have a vast numbers of duplicates. It is possibly something to add to a future version of the activity. For an approximate answer, calculating the counts is likely to add an additional cost to the implementation, so the target engine should be informed if this is required.
  • Is the output always sorted by the partition fields?
    If this naturally falls out of the implementations then it would be worth including it in the specification. Initially we will assume not, but will revisit after it has been implemented.

After all the discussions we arrived at the following syntax:


QUANTILE(<dataset>, <number-of-ranges>, { sort-order } [, <transform>(LEFT, COUNTER)]
[,FIRST][,LAST][,SKEW(<n>)][,UNSTABLE][,SCORE(<score>)][,RANGE(set)][,DEDUP][,LOCAL]

FIRST - Match the first row in the input dataset (as quantile 0)
LAST - Match the last row in the input dataset (as quantile )
SKEW - The maximum deviation from the correct results allowed. Defaults to 0.
UNSTABLE - Is the order of the original input values unimportant?
SCORE - What weighting should be applied for each row. Defaults to 1.
RANGE - Which quantiles should actually be returned. (Defaults to ALL).
DEDUP - Avoid returning a match for an input row more than once.

We also summarised a few implementation details:

  • The activity needs to be available in GLOBAL, LOCAL and GROUPED variants.
  • The code generator should derive UNSTABLE if no non-sort fields are returned.
  • Flags to indicate if a score/range is required.
  • Flag to indicate if a transform is required.

Finally, deciding on the name of the activity took almost as long as designing it!

The end result of this process was summarised in a JIRA issue: https://track.hpccsystems.com/browse/HPCC-12267, which contains details of the desired syntax and semantics. It also contains some details of the next blog topic - test cases.

Incidentally, a question that arose from of the design discussion was "What ECL can we use if we want to annotate a dataset with partition points?". Ideally the user needs a join activity which walks through a table of rows, and matches against the first row that contains key values less than or equal to the values in the search row. There are other situations where that operation would also be useful. Our conclusion was that the system does not have a simple way to achieve that, and that it was a deficiency in the current system, so another JIRA was created (see https://track.hpccsystems.com/browse/HPCC-13016). This is often how the design discussions proceed, with discussions in one area leading to new ideas in another. Similarly we concluded it would be useful to distribute rows in a dataset based on a partition (see https://track.hpccsystems.com/browse/HPCC-13260).

HIPIE 101 – Anatomy of a DUDE file

Note: This entry pertains to HIPIE which is an upcoming paid module for HPCC Systems Enterprise Edition Version.

As someone that was once introduced at an international programmers conference as a “nerd’s nerd” it was very interesting to leave a conference with the business people looking happy and the technical folks looking confused. Yet this was the feat that we achieved at the recent LexisNexis Visual Analytics Symposium. We were able to show that we have transformed a kick-ass big data platform into a high productivity visual analytics platform; but for those that know how HPCC Systems works – it was not at all clear how we had done it. The purpose of this blog series is to address some of the gap.

The hidden secret behind all of the magical looking visual tools is HIPIE – the HPCC Integrated Plug In Engine. At a most fundamental level, HIPIE exists to allow non-programmers to assemble a series of ECL Macros into a working program. Critical to this, is it requires the writer of the macro to describe the macro behavior in a detailed and specific matter. This description is referred to as a ‘contract’. This contract is written in DUDE. ‘Dude’ is not an acronym; dude is ‘hippie-speak’.

The rest of this blog entry is a primer on the structure of a DUDE file. Information on some of the individual pieces will come later.

The first part of a DUDE file is the meta-information; what the plugin is called, what it does and who can do what to it. The plugin writer has total control over what they expose – and total responsibility for exposing it.

Next comes the INPUTS section:

The INPUTS section gets translated into an HTML form which will be used to extract information from the user. There are lots of opportunities for prompting, validation etc. The ‘FIELD’ verb above specifies that after the user has selected a DATASET – one of the data fields should be selected too. The INPUTS section will typically be used to gather the input parameters to the macro being called.

After INPUTS come OUTPUTS – and this is where the magic starts:

This particular plugin provides THREE OUTPUTS. The first of these (dsOutput) is the ‘real’ output and a number of things need to be observed:

  1. dsOutput(dsInput) means that the output is going to be the input file + all of the following fields
  2. ,APPEND says the macro appends columns to the file, but does not delete any rows or any columns and does not change any of them. HIPIE verifies this is true and errors if it is not
  3. PREFIX(INPUTS.Prefix) allows the user to specify an EXTRA prefix before parse_email. This allows a plugin to be used multiple times on the same underlying file.

The other two have the ‘: SIDE’ indicator. A major part of HIPIEism is the notion that even a process that is wrapped inside a contract ought to give LOTS of aggregative information out to show HOW well the black box performed. SIDE outputs can be thought of as ‘diagnostic tables’.

Next comes the piece that has everyone most excited:

Any output (although usually it will be a side effect) can have a visualization defined. A single visualization correlates to a page of a dashboard. Each line of the VISUALIZE corresponds to one widget on the screen. The definition defines the side effect being visualized and how the visualization should look in the dashboard. The same definition also shows how the dashboard should interact (see the SELECTS option).

Finally comes the GENERATES section – this may be a little intimidating – although really it is mainly ECL:

The way to think of this is:

  1. It all eventually has to be ECL
  2. %blarg% means that ‘blarg’ was a variable used in the input section and whatever was filled in there is placed into the %blarg% before the ECL is produced.
  3. %^ means ‘HIPIE is going to do something a little weird with this label’. In the case of ^e it generates an ‘EXPORT’ but also ensures that the label provided is unique between plugins

In summary – a HIPIE plugin is defined by a DUDE file. The DUDE file has five sections:

  • META DATA – what does the plugin do / who can use it
  • INPUTS – what the plugin user must tell me to enable me to execute
  • OUTPUTS – information about the data I put out (including side-effects)
  • VISUALIZE – what is a good way to view my side effects
  • GENERATES – a template to generate the ECL that constitutes the ‘guts’ of the plugin

In the next blog entry, we will answer the question: how do you drive the interactive dashboard (VISUALIZE)?

Definition dependencies

As your body of ECL code grows it gets harder to track the dependencies between the different ECL definitions (or source files). Providing more information about the dependencies between those definitions makes it easier to understand the structure of the ECL code, and also gives you a better understanding of what queries would be affected by changing a particular definition. (i.e., If I change this, what am I going to break?)

Version 5.2 has a new option to allow that information to be available for each query that is run. When the option is enabled a new entry will appear on the helpers tab for the work unit - a link to an xml file containing all the dependencies. (Note the dependencies are gathered at parse time – so they will include any definition that is processed in order to parse the ECL – even if code from that definition is not actually included in the generated query.)

To generate the new information set the debug option 'exportDependencies' in the debug options for the workunit. To enable this feature on all workunits (and gain dependencies for all your workunits) you can add it to the eclserver default options. (A #option is not sufficient because it needs to be processed before parsing commences).

This information gives us the option to possibly add dependency graphs, and searches for all workunits that use a particular attribute to future versions of EclWatch. Of course the same information could also be used now by a user tool, or other 3rd party extension…

Google Summer of Code (GSoC) student information

We are delighted to be a mentoring organisation for GSoC 2015!

Welcome to any students who are thinking of taking on an HPCC Systems® project for GSoC.

We have created a Forum especially for GSoC here: http://hpccsystems.com/bb/ where there is a post containing information to help you get started using HPCC Systems®. If you have any general questions or comments about GSoC, post on the GSoC Forum page or email the organisation mentors:

Trish McCall - trish.mccall@lexisnexis.com
Lorraine Chapman - Lorraine.Chapman@lexisnexis.com

Check out the HPCC Systems® GSoC wiki here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Wiki where you will find information about other ways to contact us or find out information about HPCC Systems® including resources accessible from this website, details of how to connect with us via social media channels, as well as some general guidance about GSoC.

Our GSoC Ideas List for 2015 is also located in the wiki here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Ideas+List. Each project has a description including specifications for deliverables and an email address for the mentor associated with the project.

Remember, we cannot accept proposals directly via email. You must enter your proposal using Google's Melange interface here: https://www.google-melange.com/gsoc/homepage/google/gsoc2015.

The application process opens on Monday 16th March and closes on Friday 27th March. So you have some time now to select a project and get answers to your questions from the mentor.

We look forward to receiving your proposal. Good luck!

What's coming in HPCC Systems® 5.2

We're almost ready to release a gold version of HPCC Systems® 5.2 in March. It's still undergoing testing at the moment, however, there is a release candidate (HPCC Systems® 5.2.0-rc3) available on the website now: http://hpccsystems.com/download/free-community-edition/server-platform/beta.

You can, of course, build your own from the sources here: https://github.com/hpcc-systems/HPCC-Platform.

Here is a sneak preview of what you can expect to see in terms of improvements and new features:

ECL Watch improvements including graphviewer and better graph stats

New ECL Watch continues to extend its coverage of features. The facelift has been extended to include the Topology area which has been simplified into a single user interface with a consistent approach to accessing the configuration information and log files. Viewing the HPCC Systems® components and the machines on which they are running will be easier and more accessible using the tree view approach that has been implemented.

You can now view your target clusters and all their supporting details much more easily:

Viewing the configuration file for a service is also easier allowing you to navigate between the different configuration files for all services from a single page:

You can also view the entire list of machines in your environment, finding out which processes are running on a node simply by clicking on it to display the processes in a tree view while retaining access to the rest of the machines in your environment:

The new topology area is a technical preview for 5.2 and we welcome any feedback from users that will help us to continue these improvements in HPCC Systems® 6.0 later this year.

Meanwhile, behind the scenes, some internal restructuring and refactoring has been done to enable all workunit statistics and timings to be gathered more efficiently and consistently while providing a SOAP interface to allow access to them. As a result of this work, we will be able to improve on the workunit statistics we already gather and provide tools and features to analyse these statistics and spot issues in ECL queries, in particular when using graphs. ECL Watch and the new graph viewer will provide the interface to display this information.

This leads us nicely into discussing the graph control. In previous versions of HPCC Systems®, the Graph Control relied on NPAPI support which is being phased out. The effect of this is that most modern browsers will eventually be unable to view graphs in ECL Watch. As a result, we are introducing a technical preview of our new 100% Javascript Graph Viewer.

There will be no separate installation process for the new Graph Viewer, it will simply run automatically within your browser. This also has security benefits since it will run without the need to activate or install any controls or plugins from potentially unknown sources. It will also be supported by most modern browsers removing the restrictions imposed by the withdrawal of NPAPI support. The existing ActiveX/NPAPI version will continue to be the default version with the Javascript variant being used if either there is no GraphControl installed or if a browser has dropped support for it (specifically NPAPI and Chrome on Linux). The goal is to switch completely to the new Javascript graph viewer for HPCC Systems® 6.0 (targeted for later this year).

The modernization of the graph viewer also provides us with the opportunity to improve the quality of information we can supply to you about your job via the graph. The improvements will make it possible to drill down to some of the more detailed and interesting areas of a graph such as, where time was spent, where the memory usage was the largest, significant skews across nodes including minimum and maximum levels, the ability to see the progress of disk reading activities that consume rows over time without output and many other useful statistics and indicators designed to help you interpret a graph more effectively and efficiently. You can expect to see a separate blog post about this later in the year as more stats become available and viewable from within ECL Watch.

Monitoring and Alerting - Ganglia and Nagios

HPCC Systems® 5.2 includes features which mean that Ganglia monitoring is now even easier to use. In addition to the ability to monitor Roxie metrics (available as of 5.0), Ganglia graphs are now viewable directly in ECL Watch through the Ganglia service plugin that is included as part of the HPCC Systems® Ganglia Monitoring 5.2 release. With scores of metrics being reported by Roxie (over 120 possible metrics are available), the ECL Watch Ganglia plugin provides some predefined HPCC centric graphs that are likely to be of most interest to operators and administrators.

This plugin provides seamless integration between ECL Watch and Ganglia monitoring including the ability to pull up additional graphs which may be customized based on individual needs. Configuration and customization is simple, which means that existing infrastructures already using Ganglia monitoring can quickly integrate with HPCC Systems®, and new environments can incorporate Ganglia Monitoring with minimal time and effort. So for example, you can set it up to report on various aspects of system health including disk problems, CPU usage, memory or network issues etc.

Once the Ganglia plugin is installed, the ECL Watch framework automatically places an additional icon in the toolbar providing easy access to the Ganglia monitoring metrics you have configured.

Moreover, with minimal effort, the JSON configuration file can be modified to display the graphs needed within your configuration.

We have also made progress integrating Nagios into HPCC Systems®. In 5.2, Nagios alerting may be used with HPCC Systems® to check not only that a components is running but also that it is working. You can generate Nagios configurations directly from the HPCC Systems® configuration file to ensure that your checks are specific to your HPCC Systems® environment.

The following image shows notifications for all hosts in an HPCC Systems® environment via Nagios:

This image shows the Service Status Details for all hosts in an HPCC Systems® environment:

While this facility is available from the command line in 5.2, by HPCC Systems® release 5.4 we expect to complete the process by integrating the alerts into ECL Watch so that they are visible via the UI in a similar way to that used to display the Ganglia monitoring information already mentioned.

Security improvement - dafilesrv authentication and encrytion on transport

We know that security is a major issue for anyone using, manipulating and transporting data so we have added an enhanced security measure for you to implement to make sure your HPCC Systems® environment is a secure as possible.

Processes read files on other machines in an HPCC Systems® environment using dafilesrv which is installed on all machines in an HPCC Systems® environment. The security around dafilesrv has been enhanced by providing a new DALI configuration option which takes a user ID and password and provides encryption in transport.

Additional embedded language - Cassandra

Carrying on from the new features in HPCC Systems® 5.0 which provided the ability to embed external database calls in ECL, we have added an additional database plugin in HPCC Systems® 5.2, this time to embed queries to Cassandra.

Just as with the MySQL plugin, it is possible to read and write data from an external Cassandra datastore using CQL (the Cassandra Query Language) embedded inside your ECL code, complete with streaming of data into and out of the Cassandra datastore. There is a blog about the Uses, abuses and internals of the EMBED feature here: http://hpccsystems.com/bb/viewtopic.php?f=41&t=1509&sid=10f8c3890e1dfe90...) which contains usage examples and information for currently supported embedded languages including Cassandra.

New plugins - memcached and redis

We have also provided plugins for accessing key value stores memcached and redis. These are not really embedded languages (there is no “query lanquage” as such) but values can be set and retrieved by key simply by making calls to functions in the plugin.

We are also hoping to add support for the “publish/subscribe” option in redis to support a paradigm where the first query that needs a value can calculate it and store it to redis, while other queries just use the previously calculated value. Details of this are still being finalized, but the expectation is that we can use “pub/sub” to ensure that any queries that ask for the value WHILE the first query is calculating it will block until the value is available rather than repeating the calculation. This is particularly interesting when used as a cache in front of an expensive operation, for example when looking up a value via an external gateway.

New library - dmetaphone

We are pleased to be able to make the dmetaphone library widely available for the first time. This library allows strings to be converted to a form that allows them to be compared on the basis of what they sound like. It was previously only included in the HPCC Systems® Enterprise Edition because of licensing restrictions on the original source code. However, now that the source code has been placed in the public domain by its original author, we are able to include it in the Standard Library of the HPCC Systems® Community Edition. It’s very useful for performing “fuzzy” string compares, for example when looking up names where there are multiple alternate spellings. Documentation on the usage of this library has already been added into the HPCC Systems® 5.2 version of the Standard Library Reference.

ECL language features - JSON and web services

There are some new ECL Language features that may be of interest to you. Firstly, users who have JSON data or a service that requires data in JSON format may now use HPCC Systems® to process this data. The ECL language now supports the reading, writing and spraying of JSON files and can now also translate individual records to and from JSON format.

Secondly, there are some new additions to ECL for web services which will control how a query looks as a published web service. For example, we have added keyword parameters to STORED. The form, WSDL and XSD are affected and the new format parameters allow several sub parameters to be set to control field width, height and the relative position of the stored field as follows:

string s1 := 'how now brown cow' : stored('s1', 

format(fieldwidth(40), fieldheight(10), sequence(20)));

fieldwidth – controls the width of the form field for this stored input.

Fieldheight – controls the height of the form field for this stored input.

Sequence – controls the relative position of the stored field in the form, wsdl,
and schema.

The new #function #webservice provides additional web service wide features including parameters which allow you to explicitly list and order the STOREDs and provide help text and a description for the form as follows:

Fields – allows you to explicitly list and order the storeds to be used for input..
default is all storeds using “sequence” to determine order.

Help – provide help text for the webservice form

Description – provide descriptive text for the webservice form.

string descrText := 'only the listed fields should show in form, xsd, and wsdl
and in the given order

'; string helpText := 'Enter some values and hit submit'; #webservice(fields('u1', 'i1', 'u2', 'i2', 's1'), help(helpText), description(descrText));


Dynamic ESDL (Gold Release)

This product is designed to help you create robust ECL based web services with well-defined interfaces that are extensible over time. It uses the ESDL language to define web service interfaces, including the support for versioning changes as the interface evolves. The initial release gave a snapshot of what this service can do in that it provides ECL developers with a contract for implementing the web services in the form of generated ECL code. Features have now been extended to include access to services via open standards, like SOAP, REST, and JSON support and stateless operation allows linear scaling of services.

Dynamic ESDL has been available to HPCC Systems® Enterprise Edition users for some time however, we are pleased to announce that from 5.2 it will also be available to HPCC Systems® Community Edition users for the first time.The gold release includes an executable that allows users to dynamically configure and bind ESPs to ESDL interfaces.

JAPIs
The HPCC JAPIs project provides a set of JAVA based APIs which facilitate interaction with HPCC Web Services and C++ based tools. There’s no longer a need to set up your own Java networking logic in order to communicate with the HPCC Web Services, or concern yourself with the intricacies of a SOAP Envelope in order to submit an ECL workunit. Actuating HPCC WS methods is now as easy as instantiating a Platform object with the appropriate parameters, querying the specific HPCC WS client, and calling the corresponding method API with the appropriate values. Local ECL compilation is also possible by interfacing with a local installation of the eclcc executable, and if you’re working with RDF data, you can use the API to easily ingest your data into HPCC. The Java headers are available for download from GitHub: https://github.com/hpcc-systems/HPCC-JAPIs.

Enterprise Logging Service (Gold Release)

The Enterprise Logging Service supports provides a fault tolerant and scalable way of logging accounting and other transaction information with Dynamic ESDL and/or other custom HPCC front end services. It supports MySQL out of the box, but can be adapted to integrate any database on the backend. Persistent queues allow reliable storage of transaction information even when a component is not available.

The technical preview of this service was released alongside HPCC Systems® 5.0. For the gold version of this release, the main focus has been stabilisation and improvement of the internal functionality.

A new interface has been added to support passing log data in groups to the logging service. This new interface makes it easier to access the logging service from Dynamic ESDL, as well as other ESP style applications. Several deployment scripts have been added which support multiple logging agents which can be added using the HPCC ConfigMgr. The performance of the logging service has been improved by several changes such as the new filter function the new GUID function and more.

Wrapping up...

So, as you can see, there are a lot of improvements and new features to enhance your HPCC Systems® experience with more to look forward to later in the year in HPCC Systems® 6.0.

Remember to check out the HPCC Systems® Red Book (available here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Red+Book) for important information that will help you to make a smooth transition as you upgrade.

If you are upgrading to HPCC Systems® 5.2 from a 4.x version, you will notice that the new improved ECL Watch is displayed by default. To help you find your way around, use the ECL Watch Transition Guide (https://wiki.hpccsystems.com/display/hpcc/HPCC+ECL+Watch+5.0+Transition+...). This wiki also includes a guide for users upgrading from an HPCC Systems® 4.x version (https://wiki.hpccsystems.com/display/hpcc/Quick+guide+for+users+upgradin...) designed to help you find the new location of features quickly and easily.

To report issues, use our community issue tracker found here: https://track.hpccsystems.com/secure/Dashboard.jspa.

Adventures in GraphLand VI (Coprophagy)

The other evening my family was watching some of our younger tortoises acclimatize to a new enclosure. One of them fell off of a log and face-planted into some recently dropped fecal matter. Far from being perturbed she immediately opened her mouth and started eating. My teenage son was entirely grossed and exclaimed: “Ew; she’s eating poop!” My wife, looking somewhat perplexed responded: “Yes, and that’s not even the good stuff!”

For those not familiar with reptile coprophagy; young tortoises will often eat droppings from older tortoises as it provides useful stomach fauna and a range of partially digested nutrients that the babies might not otherwise have access to. The older (and more senior) the tortoise; the higher the quality of the scat. In the world of baby tortoises: senior tortoise poop is definitely “the good stuff”.

The reason for this memorable if unpalatable introduction is to assert the notion that sometimes we need to ingest into Graphland data that contains useful information even if the current form is not immediately appealing. “The good stuff” often comes in a form suitable for re-digestion; not in the form suitable for knowledge engineering.

In case you are wondering if this is a reprise of ‘GraphLand V’ (dealing with dirty data), it isn’t. Even if this data is ‘the good stuff’ it may still take some significant work and planning to hammer it into a nice Entity Model.

Probably the commonest, and certainly most important case, is where the incoming data is in the form of multi-entity transactions. As a simplified but essentially real world example: this is how vehicle and person data usually turn up:

Each record represents a transaction. Two people are selling a single vehicle to two people. Each transaction therefore provides partial information regarding five different entities. There are also a number of implicit associations I might choose to derive, bought, sold and potentially ‘co-owns’ to represent that two people appeared on the same side of a transaction. The question is how do we take this rather messy looking data and bring it into KEL?

The first rule is that you do NOT construct your entity model based upon the data. You base your model upon how you wish you had the data. In fact we already have this entity model; so I’ll replicate it here with a couple of tweaks.

Hopefully the above will make sense to you; but just in case:

  1. We have a person entity, two properties – Name and Age. MODEL(*) forces every property to be single valued.
  2. We have a vehicle entity, two properties – Make and Colour
  3. We have a ‘CoTransacts’ association between two people
  4. We have an association from People to Vehicles for which the TYPE is defined in the association.

The question then becomes how do we extract the data from our big, wide transaction record into our entity model? We will start off by extracting the people. We have four to extract from one record. We will do this using the USE statement. You have seen the ‘USE’ statement already – but this one is going to be a little scarier:

  • First note that there are 4 Person sections to the USE clause. That is one for each person we are extracting.
  • The first of the Person clauses uses the field label override syntax to extract a particular did, name and age from the record.
  • The remaining three do exactly the same thing; but they use a piece of syntactic sugar to make life easier. If you say owner2_* then KEL will look for every default label in the entity prepended by owner2_.

Dumping the person entity we now get:

Note that all eight entities from the transaction are in the data (David and Kelly have a _recordcount of 2).

Bringing in the Vehicle is also easy; but it illustrates a useful point:

The field override syntax is used for the UID (otherwise it would expect a UID or ‘uid’ without the prefix). The other two fields (make and colour) are in the record with the default names so they do not need to be overridden. If you like typing; you can fill all of the fields in for completeness; but you don’t need to.

With the entities in, it is time to pull in the associations. CoTransacts first:

The override syntax to assign the unique IDs should be fairly comfortable by now. One thing that might surprise you is that I am using TWO associations for one relationship. I don’t have to do this – I can put one relationship in and walk both ways – but sometimes you want to do the above. We will tackle some of the subtler relationship types in a later blog. The above gives:

By now you should immediately spot that the two different instances of a relationship between 1&2 have been represented using __recordcount = 2.

Finally PerVeh:

This is one of those rare cases I am prepared to concede that late-typing an association is useful. We are almost certainly going to want to compare/contrast buy and sell transactions so giving them the same type is useful. So, when registering the relationships from a transaction, I use the ‘constant assignment’ form of the USE syntax to note that there are two buying and two selling relationships being created here. The result:

We have captured everything in the original transaction that is represented in our model. From each transaction record we produce four entity instances and eight association instances. We saw how common consistent naming can produce very succinct KEL (and the work around when the naming is hostile).

In closing I want to present a more complex model that keeps track of transaction dates. I am going to track both the dates over which people Cotransact and also when the buy-sell transactions happen. The association syntax IS quite a bit more exotic than the preceding which I’ll expound upon the details in a later blog.

Notes:

  • Only the ASSOCIATIONs changed
  • The ASSOCIATIONs now have a MODEL.
  • For CoTransacts this says that a given who/whoelse pair will have one (and only one) association of this type, and we keep track of all the transaction dates
  • For PerVeh we have one association for every Per/Veh pair. We then keep a table (called Transaction) detailing the Type and Date of each transaction

With this declaration and the previous data we get CoTransactions:

The two associations with two transactions now carry the date of the transaction. For PerVeh we get:

Many traditional data system take one of three easy views of data structure. Either they work on the data in the format it is in or they assume someone else has massaged the data into shape or they assume data has no real shape.

Even if some of the details are a little fuzzy, and building a strong Entity Model is a non-trivial task, I hope that I have convinced you that in GraphLand you should not take the easy way out. Knowledge has structure and we need to define that structure (using ENTITY, ASSOCIATION and MODEL). If we have to USE data in a structure that is currently unpalatable; we have a digestive system that is able to do so.

Adventures in Graphland Series
Part I - Adventures in GraphLand
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand (Graphland gets a Reality Check)
Part VI - Adventures in GraphLand (Coprophagy)

Contact Us

email us   Email us
Toll-free   US: 1.877.316.9669
International   Intl: 1.678.694.2200

Sign up to get updates through
our social media channels:

facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

Get Started