Blogs

HIPIE 101 – Anatomy of a DUDE file

Note: This entry pertains to HIPIE which is an upcoming paid module for HPCC Systems Enterprise Edition Version.

As someone that was once introduced at an international programmers conference as a “nerd’s nerd” it was very interesting to leave a conference with the business people looking happy and the technical folks looking confused. Yet this was the feat that we achieved at the recent LexisNexis Visual Analytics Symposium. We were able to show that we have transformed a kick-ass big data platform into a high productivity visual analytics platform; but for those that know how HPCC Systems works – it was not at all clear how we had done it. The purpose of this blog series is to address some of the gap.

The hidden secret behind all of the magical looking visual tools is HIPIE – the HPCC Integrated Plug In Engine. At a most fundamental level, HIPIE exists to allow non-programmers to assemble a series of ECL Macros into a working program. Critical to this, is it requires the writer of the macro to describe the macro behavior in a detailed and specific matter. This description is referred to as a ‘contract’. This contract is written in DUDE. ‘Dude’ is not an acronym; dude is ‘hippie-speak’.

The rest of this blog entry is a primer on the structure of a DUDE file. Information on some of the individual pieces will come later.

The first part of a DUDE file is the meta-information; what the plugin is called, what it does and who can do what to it. The plugin writer has total control over what they expose – and total responsibility for exposing it.

Next comes the INPUTS section:

The INPUTS section gets translated into an HTML form which will be used to extract information from the user. There are lots of opportunities for prompting, validation etc. The ‘FIELD’ verb above specifies that after the user has selected a DATASET – one of the data fields should be selected too. The INPUTS section will typically be used to gather the input parameters to the macro being called.

After INPUTS come OUTPUTS – and this is where the magic starts:

This particular plugin provides THREE OUTPUTS. The first of these (dsOutput) is the ‘real’ output and a number of things need to be observed:

  1. dsOutput(dsInput) means that the output is going to be the input file + all of the following fields
  2. ,APPEND says the macro appends columns to the file, but does not delete any rows or any columns and does not change any of them. HIPIE verifies this is true and errors if it is not
  3. PREFIX(INPUTS.Prefix) allows the user to specify an EXTRA prefix before parse_email. This allows a plugin to be used multiple times on the same underlying file.

The other two have the ‘: SIDE’ indicator. A major part of HIPIEism is the notion that even a process that is wrapped inside a contract ought to give LOTS of aggregative information out to show HOW well the black box performed. SIDE outputs can be thought of as ‘diagnostic tables’.

Next comes the piece that has everyone most excited:

Any output (although usually it will be a side effect) can have a visualization defined. A single visualization correlates to a page of a dashboard. Each line of the VISUALIZE corresponds to one widget on the screen. The definition defines the side effect being visualized and how the visualization should look in the dashboard. The same definition also shows how the dashboard should interact (see the SELECTS option).

Finally comes the GENERATES section – this may be a little intimidating – although really it is mainly ECL:

The way to think of this is:

  1. It all eventually has to be ECL
  2. %blarg% means that ‘blarg’ was a variable used in the input section and whatever was filled in there is placed into the %blarg% before the ECL is produced.
  3. %^ means ‘HIPIE is going to do something a little weird with this label’. In the case of ^e it generates an ‘EXPORT’ but also ensures that the label provided is unique between plugins

In summary – a HIPIE plugin is defined by a DUDE file. The DUDE file has five sections:

  • META DATA – what does the plugin do / who can use it
  • INPUTS – what the plugin user must tell me to enable me to execute
  • OUTPUTS – information about the data I put out (including side-effects)
  • VISUALIZE – what is a good way to view my side effects
  • GENERATES – a template to generate the ECL that constitutes the ‘guts’ of the plugin

In the next blog entry, we will answer the question: how do you drive the interactive dashboard (VISUALIZE)?

Definition dependencies

As your body of ECL code grows it gets harder to track the dependencies between the different ECL definitions (or source files). Providing more information about the dependencies between those definitions makes it easier to understand the structure of the ECL code, and also gives you a better understanding of what queries would be affected by changing a particular definition. (i.e., If I change this, what am I going to break?)

Version 5.2 has a new option to allow that information to be available for each query that is run. When the option is enabled a new entry will appear on the helpers tab for the work unit - a link to an xml file containing all the dependencies. (Note the dependencies are gathered at parse time – so they will include any definition that is processed in order to parse the ECL – even if code from that definition is not actually included in the generated query.)

To generate the new information set the debug option 'exportDependencies' in the debug options for the workunit. To enable this feature on all workunits (and gain dependencies for all your workunits) you can add it to the eclserver default options. (A #option is not sufficient because it needs to be processed before parsing commences).

This information gives us the option to possibly add dependency graphs, and searches for all workunits that use a particular attribute to future versions of EclWatch. Of course the same information could also be used now by a user tool, or other 3rd party extension…

Google Summer of Code (GSoC) student information

We are delighted to be a mentoring organisation for GSoC 2015!

Welcome to any students who are thinking of taking on an HPCC Systems project for GSoC.

We have created a Forum especially for GSoC here: http://hpccsystems.com/bb/ where there is a post containing information to help you get started using HPCC Systems. If you have any general questions or comments about GSoC, post on the GSoC Forum page or email the organisation mentors:

Trish McCall - trish.mccall@lexisnexis.com
Lorraine Chapman - Lorraine.Chapman@lexisnexis.com

Check out the GSoC wiki here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Wiki where you will find information about other ways to contact us or find out information about HPCC Systems including resources accessible from this website, details of how to connect with us via social media channels, as well as some general guidance about GSoC.

Our GSoC Ideas List for 2015 is also located in the wiki here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+GSoC+2015+Ideas+List. Each project has a description including specifications for deliverables and an email address for the mentor associated with the project.

Remember, we cannot accept proposals directly via email. You must enter your proposal using Google's Melange interface here: https://www.google-melange.com/gsoc/homepage/google/gsoc2015.

The application process opens on Monday 16th March and closes on Friday 27th March. So you have some time now to select a project and get answers to your questions from the mentor.

We look forward to receiving your proposal. Good luck!

What's coming in HPCC Systems 5.2

We're almost ready to release a gold version of HPCC Systems® 5.2 in March. It's still undergoing testing at the moment, however, there is a release candidate (HPCC Systems® 5.2.0-rc3) available on the website now: http://hpccsystems.com/download/free-community-edition/server-platform/beta.

You can, of course, build your own from the sources here: https://github.com/hpcc-systems/HPCC-Platform.

Here is a sneak preview of what you can expect to see in terms of improvements and new features:

ECL Watch improvements including graphviewer and better graph stats

New ECL Watch continues to extend its coverage of features. The facelift has been extended to include the Topology area which has been simplified into a single user interface with a consistent approach to accessing the configuration information and log files. Viewing the HPCC Systems® components and the machines on which they are running will be easier and more accessible using the tree view approach that has been implemented.

You can now view your target clusters and all their supporting details much more easily:

Viewing the configuration file for a service is also easier allowing you to navigate between the different configuration files for all services from a single page:

You can also view the entire list of machines in your environment, finding out which processes are running on a node simply by clicking on it to display the processes in a tree view while retaining access to the rest of the machines in your environment:

The new topology area is a technical preview for 5.2 and we welcome any feedback from users that will help us to continue these improvements in HPCC Systems® 6.0 later this year.

Meanwhile, behind the scenes, some internal restructuring and refactoring has been done to enable all workunit statistics and timings to be gathered more efficiently and consistently while providing a SOAP interface to allow access to them. As a result of this work, we will be able to improve on the workunit statistics we already gather and provide tools and features to analyse these statistics and spot issues in ECL queries, in particular when using graphs. ECL Watch and the new graph viewer will provide the interface to display this information.

This leads us nicely into discussing the graph control. In previous versions of HPCC Systems®, the Graph Control relied on NPAPI support which is being phased out. The effect of this is that most modern browsers will eventually be unable to view graphs in ECL Watch. As a result, we are introducing a technical preview of our new 100% Javascript Graph Viewer.

There will be no separate installation process for the new Graph Viewer, it will simply run automatically within your browser. This also has security benefits since it will run without the need to activate or install any controls or plugins from potentially unknown sources. It will also be supported by most modern browsers removing the restrictions imposed by the withdrawal of NPAPI support. The existing ActiveX/NPAPI version will continue to be the default version with the Javascript variant being used if either there is no GraphControl installed or if a browser has dropped support for it (specifically NPAPI and Chrome on Linux). The goal is to switch completely to the new Javascript graph viewer for HPCC Systems® 6.0 (targeted for later this year).

The modernization of the graph viewer also provides us with the opportunity to improve the quality of information we can supply to you about your job via the graph. The improvements will make it possible to drill down to some of the more detailed and interesting areas of a graph such as, where time was spent, where the memory usage was the largest, significant skews across nodes including minimum and maximum levels, the ability to see the progress of disk reading activities that consume rows over time without output and many other useful statistics and indicators designed to help you interpret a graph more effectively and efficiently. You can expect to see a separate blog post about this later in the year as more stats become available and viewable from within ECL Watch.

Monitoring and Alerting - Ganglia and Nagios

HPCC Systems® 5.2 includes features which mean that Ganglia monitoring is now even easier to use. In addition to the ability to monitor Roxie metrics (available as of 5.0), Ganglia graphs are now viewable directly in ECL Watch through the Ganglia service plugin that is included as part of the HPCC Systems® Ganglia Monitoring 5.2 release. With scores of metrics being reported by Roxie (over 120 possible metrics are available), the ECL Watch Ganglia plugin provides some predefined HPCC centric graphs that are likely to be of most interest to operators and administrators.

This plugin provides seamless integration between ECL Watch and Ganglia monitoring including the ability to pull up additional graphs which may be customized based on individual needs. Configuration and customization is simple, which means that existing infrastructures already using Ganglia monitoring can quickly integrate with HPCC Systems®, and new environments can incorporate Ganglia Monitoring with minimal time and effort. So for example, you can set it up to report on various aspects of system health including disk problems, CPU usage, memory or network issues etc.

Once the Ganglia plugin is installed, the ECL Watch framework automatically places an additional icon in the toolbar providing easy access to the Ganglia monitoring metrics you have configured.

Moreover, with minimal effort, the JSON configuration file can be modified to display the graphs needed within your configuration.

We have also made progress integrating Nagios into HPCC Systems®. In 5.2, Nagios alerting may be used with HPCC Systems® to check not only that a components is running but also that it is working. You can generate Nagios configurations directly from the HPCC Systems® configuration file to ensure that your checks are specific to your HPCC Systems® environment.

The following image shows notifications for all hosts in an HPCC Systems® environment via Nagios:

This image shows the Service Status Details for all hosts in an HPCC Systems® environment:

While this facility is available from the command line in 5.2, by HPCC Systems® release 5.4 we expect to complete the process by integrating the alerts into ECL Watch so that they are visible via the UI in a similar way to that used to display the Ganglia monitoring information already mentioned.

Security improvement - dafilesrv authentication and encrytion on transport

We know that security is a major issue for anyone using, manipulating and transporting data so we have added an enhanced security measure for you to implement to make sure your HPCC Systems® environment is a secure as possible.

Processes read files on other machines in an HPCC Systems® environment using dafilesrv which is installed on all machines in an HPCC Systems® environment. The security around dafilesrv has been enhanced by providing a new DALI configuration option which takes a user ID and password and provides encryption in transport.

Additional embedded language - Cassandra

Carrying on from the new features in HPCC Systems® 5.0 which provided the ability to embed external database calls in ECL, we have added an additional database plugin in HPCC Systems® 5.2, this time to embed queries to Cassandra.

Just as with the MySQL plugin, it is possible to read and write data from an external Cassandra datastore using CQL (the Cassandra Query Language) embedded inside your ECL code, complete with streaming of data into and out of the Cassandra datastore. There is a blog about the Uses, abuses and internals of the EMBED feature here: http://hpccsystems.com/bb/viewtopic.php?f=41&t=1509&sid=10f8c3890e1dfe90...) which contains usage examples and information for currently supported embedded languages including Cassandra.

New plugins - memcached and redis

We have also provided plugins for accessing key value stores memcached and redis. These are not really embedded languages (there is no “query lanquage” as such) but values can be set and retrieved by key simply by making calls to functions in the plugin.

We are also hoping to add support for the “publish/subscribe” option in redis to support a paradigm where the first query that needs a value can calculate it and store it to redis, while other queries just use the previously calculated value. Details of this are still being finalized, but the expectation is that we can use “pub/sub” to ensure that any queries that ask for the value WHILE the first query is calculating it will block until the value is available rather than repeating the calculation. This is particularly interesting when used as a cache in front of an expensive operation, for example when looking up a value via an external gateway.

New library - dmetaphone

We are pleased to be able to make the dmetaphone library widely available for the first time. This library allows strings to be converted to a form that allows them to be compared on the basis of what they sound like. It was previously only included in the HPCC Systems® Enterprise Edition because of licensing restrictions on the original source code. However, now that the source code has been placed in the public domain by its original author, we are able to include it in the Standard Library of the HPCC Systems® Community Edition. It’s very useful for performing “fuzzy” string compares, for example when looking up names where there are multiple alternate spellings. Documentation on the usage of this library has already been added into the HPCC Systems® 5.2 version of the Standard Library Reference.

ECL language features - JSON and web services

There are some new ECL Language features that may be of interest to you. Firstly, users who have JSON data or a service that requires data in JSON format may now use HPCC Systems® to process this data. The ECL language now supports the reading, writing and spraying of JSON files and can now also translate individual records to and from JSON format.

Secondly, there are some new additions to ECL for web services which will control how a query looks as a published web service. For example, we have added keyword parameters to STORED. The form, WSDL and XSD are affected and the new format parameters allow several sub parameters to be set to control field width, height and the relative position of the stored field as follows:

string s1 := 'how now brown cow' : stored('s1', 

format(fieldwidth(40), fieldheight(10), sequence(20)));

fieldwidth – controls the width of the form field for this stored input.

Fieldheight – controls the height of the form field for this stored input.

Sequence – controls the relative position of the stored field in the form, wsdl,
and schema.

The new #function #webservice provides additional web service wide features including parameters which allow you to explicitly list and order the STOREDs and provide help text and a description for the form as follows:

Fields – allows you to explicitly list and order the storeds to be used for input..
default is all storeds using “sequence” to determine order.

Help – provide help text for the webservice form

Description – provide descriptive text for the webservice form.

string descrText := 'only the listed fields should show in form, xsd, and wsdl
and in the given order

'; string helpText := 'Enter some values and hit submit'; #webservice(fields('u1', 'i1', 'u2', 'i2', 's1'), help(helpText), description(descrText));


Dynamic ESDL (Gold Release)

This product is designed to help you create robust ECL based web services with well-defined interfaces that are extensible over time. It uses the ESDL language to define web service interfaces, including the support for versioning changes as the interface evolves. The initial release gave a snapshot of what this service can do in that it provides ECL developers with a contract for implementing the web services in the form of generated ECL code. Features have now been extended to include access to services via open standards, like SOAP, REST, and JSON support and stateless operation allows linear scaling of services.

Dynamic ESDL has been available to HPCC Systems® Enterprise Edition users for some time however, we are pleased to announce that from 5.2 it will also be available to HPCC Systems® Community Edition users for the first time.The gold release includes an executable that allows users to dynamically configure and bind ESPs to ESDL interfaces.

JAPIs
The HPCC JAPIs project provides a set of JAVA based APIs which facilitate interaction with HPCC Web Services and C++ based tools. There’s no longer a need to set up your own Java networking logic in order to communicate with the HPCC Web Services, or concern yourself with the intricacies of a SOAP Envelope in order to submit an ECL workunit. Actuating HPCC WS methods is now as easy as instantiating a Platform object with the appropriate parameters, querying the specific HPCC WS client, and calling the corresponding method API with the appropriate values. Local ECL compilation is also possible by interfacing with a local installation of the eclcc executable, and if you’re working with RDF data, you can use the API to easily ingest your data into HPCC. The Java headers are available for download from GitHub: https://github.com/hpcc-systems/HPCC-JAPIs.

Enterprise Logging Service (Gold Release)

The Enterprise Logging Service supports provides a fault tolerant and scalable way of logging accounting and other transaction information with Dynamic ESDL and/or other custom HPCC front end services. It supports MySQL out of the box, but can be adapted to integrate any database on the backend. Persistent queues allow reliable storage of transaction information even when a component is not available.

The technical preview of this service was released alongside HPCC Systems® 5.0. For the gold version of this release, the main focus has been stabilisation and improvement of the internal functionality.

A new interface has been added to support passing log data in groups to the logging service. This new interface makes it easier to access the logging service from Dynamic ESDL, as well as other ESP style applications. Several deployment scripts have been added which support multiple logging agents which can be added using the HPCC ConfigMgr. The performance of the logging service has been improved by several changes such as the new filter function the new GUID function and more.

Wrapping up...

So, as you can see, there are a lot of improvements and new features to enhance your HPCC Systems® experience with more to look forward to later in the year in HPCC Systems® 6.0.

Remember to check out the HPCC Systems® Red Book (available here: https://wiki.hpccsystems.com/display/hpcc/HPCC+Systems+Red+Book) for important information that will help you to make a smooth transition as you upgrade.

If you are upgrading to HPCC Systems® 5.2 from a 4.x version, you will notice that the new improved ECL Watch is displayed by default. To help you find your way around, use the ECL Watch Transition Guide (https://wiki.hpccsystems.com/display/hpcc/HPCC+ECL+Watch+5.0+Transition+...). This wiki also includes a guide for users upgrading from an HPCC Systems® 4.x version (https://wiki.hpccsystems.com/display/hpcc/Quick+guide+for+users+upgradin...) designed to help you find the new location of features quickly and easily.

To report issues, use our community issue tracker found here: https://track.hpccsystems.com/secure/Dashboard.jspa.

Adventures in GraphLand VI (Coprophagy)

The other evening my family was watching some of our younger tortoises acclimatize to a new enclosure. One of them fell off of a log and face-planted into some recently dropped fecal matter. Far from being perturbed she immediately opened her mouth and started eating. My teenage son was entirely grossed and exclaimed: “Ew; she’s eating poop!” My wife, looking somewhat perplexed responded: “Yes, and that’s not even the good stuff!”

For those not familiar with reptile coprophagy; young tortoises will often eat droppings from older tortoises as it provides useful stomach fauna and a range of partially digested nutrients that the babies might not otherwise have access to. The older (and more senior) the tortoise; the higher the quality of the scat. In the world of baby tortoises: senior tortoise poop is definitely “the good stuff”.

The reason for this memorable if unpalatable introduction is to assert the notion that sometimes we need to ingest into Graphland data that contains useful information even if the current form is not immediately appealing. “The good stuff” often comes in a form suitable for re-digestion; not in the form suitable for knowledge engineering.

In case you are wondering if this is a reprise of ‘GraphLand V’ (dealing with dirty data), it isn’t. Even if this data is ‘the good stuff’ it may still take some significant work and planning to hammer it into a nice Entity Model.

Probably the commonest, and certainly most important case, is where the incoming data is in the form of multi-entity transactions. As a simplified but essentially real world example: this is how vehicle and person data usually turn up:

Each record represents a transaction. Two people are selling a single vehicle to two people. Each transaction therefore provides partial information regarding five different entities. There are also a number of implicit associations I might choose to derive, bought, sold and potentially ‘co-owns’ to represent that two people appeared on the same side of a transaction. The question is how do we take this rather messy looking data and bring it into KEL?

The first rule is that you do NOT construct your entity model based upon the data. You base your model upon how you wish you had the data. In fact we already have this entity model; so I’ll replicate it here with a couple of tweaks.

Hopefully the above will make sense to you; but just in case:

  1. We have a person entity, two properties – Name and Age. MODEL(*) forces every property to be single valued.
  2. We have a vehicle entity, two properties – Make and Colour
  3. We have a ‘CoTransacts’ association between two people
  4. We have an association from People to Vehicles for which the TYPE is defined in the association.

The question then becomes how do we extract the data from our big, wide transaction record into our entity model? We will start off by extracting the people. We have four to extract from one record. We will do this using the USE statement. You have seen the ‘USE’ statement already – but this one is going to be a little scarier:

  • First note that there are 4 Person sections to the USE clause. That is one for each person we are extracting.
  • The first of the Person clauses uses the field label override syntax to extract a particular did, name and age from the record.
  • The remaining three do exactly the same thing; but they use a piece of syntactic sugar to make life easier. If you say owner2_* then KEL will look for every default label in the entity prepended by owner2_.

Dumping the person entity we now get:

Note that all eight entities from the transaction are in the data (David and Kelly have a _recordcount of 2).

Bringing in the Vehicle is also easy; but it illustrates a useful point:

The field override syntax is used for the UID (otherwise it would expect a UID or ‘uid’ without the prefix). The other two fields (make and colour) are in the record with the default names so they do not need to be overridden. If you like typing; you can fill all of the fields in for completeness; but you don’t need to.

With the entities in, it is time to pull in the associations. CoTransacts first:

The override syntax to assign the unique IDs should be fairly comfortable by now. One thing that might surprise you is that I am using TWO associations for one relationship. I don’t have to do this – I can put one relationship in and walk both ways – but sometimes you want to do the above. We will tackle some of the subtler relationship types in a later blog. The above gives:

By now you should immediately spot that the two different instances of a relationship between 1&2 have been represented using __recordcount = 2.

Finally PerVeh:

This is one of those rare cases I am prepared to concede that late-typing an association is useful. We are almost certainly going to want to compare/contrast buy and sell transactions so giving them the same type is useful. So, when registering the relationships from a transaction, I use the ‘constant assignment’ form of the USE syntax to note that there are two buying and two selling relationships being created here. The result:

We have captured everything in the original transaction that is represented in our model. From each transaction record we produce four entity instances and eight association instances. We saw how common consistent naming can produce very succinct KEL (and the work around when the naming is hostile).

In closing I want to present a more complex model that keeps track of transaction dates. I am going to track both the dates over which people Cotransact and also when the buy-sell transactions happen. The association syntax IS quite a bit more exotic than the preceding which I’ll expound upon the details in a later blog.

Notes:

  • Only the ASSOCIATIONs changed
  • The ASSOCIATIONs now have a MODEL.
  • For CoTransacts this says that a given who/whoelse pair will have one (and only one) association of this type, and we keep track of all the transaction dates
  • For PerVeh we have one association for every Per/Veh pair. We then keep a table (called Transaction) detailing the Type and Date of each transaction

With this declaration and the previous data we get CoTransactions:

The two associations with two transactions now carry the date of the transaction. For PerVeh we get:

Many traditional data system take one of three easy views of data structure. Either they work on the data in the format it is in or they assume someone else has massaged the data into shape or they assume data has no real shape.

Even if some of the details are a little fuzzy, and building a strong Entity Model is a non-trivial task, I hope that I have convinced you that in GraphLand you should not take the easy way out. Knowledge has structure and we need to define that structure (using ENTITY, ASSOCIATION and MODEL). If we have to USE data in a structure that is currently unpalatable; we have a digestive system that is able to do so.

Adventures in Graphland Series
Part I - Adventures in GraphLand
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand (Graphland gets a Reality Check)
Part VI - Adventures in GraphLand (Coprophagy)

In pursuit of perils : Geo-spatial risk analysis through HPCC Systems

I have always wanted to say I was working on big data projects and in my own time I have dabbled with frameworks such as Hadoop on Amazon EC2 but could not make the case for merging my day job with the big data aspiration. Now enter LexisNexis® Risk Solutions, who acquired our company in 2013, who, guess what, excel in big data through HPCC Systems® (http://hpccsystems.com/), their open source big data processing platform that powers their data services business. At last, there was potential to merge the big data aspiration with the day job. Nice.

Our solution is called Map View, which is is used by over 8,000 home and commercial property underwriters worldwide. Map View is a geo-spatial risk intelligence solution for the insurance industry; which for the most part consists of correlating existing and potential policy property locations with commercial and open peril data. This peril data informs the insurer what existing and potential risk exposure exists at the provided locations. Peril data consists of models pertaining to such risks as coastal flood, river flood, surface water, earthquake, subsidence, crime, wind and more. The insurer uses this risk assessment to decide what the suitable premium should be based on the severity of the discovered risks. Insurers also use this information to review clusters of risk exposure at various locations.

So that’s what we have been doing, and it has been going well from a technical point of view, but the bottleneck has always been scale, although we optimised our algorithms to play nice with the spatial relational database platform we were using, it still did not realise the order of magnitude performance improvements we needed.

Now for the crux of what this blog entry is about: leveraging HPCC Systems as our geo-spatial analysis engine and repository of our vector geometry and raster data.

The geo-spatial algorithms and binary geo-reference file handlers that we have been using did not exist in HPCC Systems, but Enterprise Control Language (ECL) the programming language for HPCC Systems, is extendable. It was clear, that to bring our geospatial processes into the realm of big data, we needed to extend ECL and that’s when we reached out to our colleagues in the greater LexisNexis family. We reached out to the HPCC Systems algorithms group within LexisNexis technology, who meet on a regular basis. The group was the perfect sounding board for our first experiments in using bespoke C++ with HPCC Systems. Thanks to these guys we got a C++ plug-in skeleton up and running.

A few of us met in the Dublin office to determine the core spatial capabilities we would need in HPCC Systems. Having established this list, rather than implementing algorithms from scratch, we were familiar
with some geo-spatial libraries that include most if not all of what we were looking for, so no need to cherry pick, let’s just integrate the entire library and expose what we need to ECL.

The main capabilities we are looking for fall into 4 categories:

  1. Spatial filtering of vector geometries
  2. Spatial operations using vector geometries
  3. Spatial reference projection and transformation
  4. Reading of compressed geo-raster files

The libraries we focused on were GDAL/OGR, GEOS, and Proj.4; The GDAL/OGR library built to include the use of the GEOS library provides spatial filtering satisfying the use of DE-9IM spatial predicates (e.g. within, touches, intersects etc...) and also provides various spatial operations such as distance and convex-hull etc.; OGR which is attached at the hip to GDAL is a simple feature library and it accepts vector geometries as Well Known Text (WKT) wrapped as a string literal. The Proj.4 library encapsulates functionality for the projection of spatial references e.g. WGS84, UTM, Web Mercator, or NAD83 etc.; it also is used for the transformation of coordinates from one spatial reference system to another. GDAL provides functionality which allows for the reading of GEOTiff binary raster files.

Sample use case: augmenting building data with derived peril fields

The following map shows various buildings (labelled 1-7), a fire station, a police station, and a large flood zone based on a 500m buffer either side of a river.

For this sample use case, the business objective is to augment an existing dataset of buildings with peril information by correlating their physical location with available geographic peril data. The ECL executed within HPCC Systems will use the GDAL library to perform the distance and point-in-polygon spatial operations.

Image showing a sample map showing elements used for augmenting building data with derived peril fields

In particular we want to augment the building data with 3 perils:

  1. Fire risk category based on distance away from fire station
  2. Crime risk category based on distance away from police station
  3. Flood risk flag based on being within a flood zone

We start off using this buildings dataset.

An image illustrating the use of an ECL Transform to augment building data with inferred geo-spatial values

The ECL used to produce the above result:

import $.GeometryLite as Geometry; 

BuildingLayout := RECORD
  STRING geom;
  INTEGER4 id;
END;

PerilAugmentedBuildingLayout := RECORD
  BuildingLayout;
  INTEGER4 distanceToFireStation;
  STRING1 fireRiskCategory;
  INTEGER4 distanceToPoliceStation;
  STRING1 crimeRiskCategory;
  BOOLEAN isAtRiskOfFlood;
END;

// A recordset of points in WKT and using WGS84 as the coordinate system
buildings_RSet := DATASET(
  [
    {'POINT(-84.26180076961152565 34.07911885643056848)',	7},
    {'POINT(-84.28096707737662996 34.07386384848369687)',	6},
    {'POINT(-84.28034594703238724 34.06441871524530995)',	5},
    {'POINT(-84.27511070555951278 34.07632602535307598)',	4},
    {'POINT(-84.26401909226950693 34.07132809899221115)',	3},
    {'POINT(-84.2606472418293464 34.06904953470633046)',	2},
    {'POINT(-84.25332677705793571 34.07173235399688593)',	1}
  ],
  BuildingLayout
);

// SRID = Spatial Reference System Identifier, and in this case correlates to the matching EPSG id (http://www.epsg.org/)
// Universal Transverse Mercator (UTM) Zone 16 North... X,Y in meters, good for showing local distances
UTMZ16N_SRID := 32616; 

// World Geodetic System (WGS) ... Longitude,Latitude good for using as the base coordinate system
WGS84_SRID := 4326; 

// The location of the fire station given as WGS84 and also converted to a local UTM point
fire_station_point := 'POINT(-84.27361647195701266 34.07592838651884648)';
fire_station_point_UTMZ16N := Geometry.toSRID(fire_station_point,WGS84_SRID,UTMZ16N_SRID);

// The location of the police station given as WGS84 and also converted to a local UTM point
police_station_point := 'POINT(-84.28388903577211977 34.06841445050786632)';
police_station_point_UTMZ16N := Geometry.toSRID(police_station_point,WGS84_SRID,UTMZ16N_SRID);

// The large flood zone...pre-generated from a line representing a segment of river path and buffered 500m both sides of that line
river_flood_buffer_polygon := 'POLYGON ((-84.275569480814269 34.053149448508812,-84.275502368216152 34.053622020238336,-84.275461577548469 34.053856279819868,-84.275406062884485 34.054088433147875,-84.275335977314512 34.054317839528537,-84.275251514146206 34.054543865844074,-84.275152906372298 34.054765888300402,-84.275040426028681 34.05498329414894,-84.274914383444596 34.055195483377922,-84.274775126387269 34.055401870368861,-84.274623039102778 34.055601885513035,-84.274458541256465 34.055794976783979,-84.274282086775159 34.055980611261425,-84.274094162595034 34.056158276602602,-84.273895287317842 34.056327482456609,-84.273686009780079 34.056487761818282,-84.27346690753815 34.056638672317504,-84.271467966412999 34.057943350956521,-84.271449849097706 34.057955123901209,-84.269457062666064 34.059244375758553,-84.267464263063459 34.060541242500669,-84.267230385493164 34.060685184060439,-84.266987555911015 34.060818462943388,-84.26673647534858 34.060940694367893,-84.266477868667863 34.061051525446047,-84.266212482468021 34.061150636202917,-84.265941082929231 34.06123774050058,-84.265664453599811 34.061312586864624,-84.265383393133249 34.06137495921039,-84.265341784003652 34.061382226124834,-84.265314901719449 34.061585486783336,-84.265310183323066 34.061619943334122,-84.265111905163181 34.063019950080957,-84.26507118648162 34.063254301019093,-84.265015731142569 34.063486548072767,-84.264945692179822 34.063716049839776,-84.264861262907829 34.063942172494826,-84.264762676389054 34.064164291540052,-84.264650204791437 34.064381793530238,-84.264524158637713 34.064594077767168,-84.264384885948843 34.064800557959067,-84.264232771283574 34.065000663840181,-84.264068234677239 34.065193842746154,-84.263891730482143 34.06537956114078,-84.263703746113279 34.065557306090064,-84.26350480070252 34.06572658667919,-84.263295443665044 34.065886935368937,-84.263076253182007 34.066037909287367,-84.262751452284149 34.066249907042753,-84.262614616529277 34.06728718811172,-84.262609735088645 34.067322879116531,-84.262411456952634 34.068722885868624,-84.262370565581591 34.06895807170757,-84.262314831409327 34.069191136361503,-84.262244409341704 34.069421431556577,-84.262159495144019 34.069648316717327,-84.262060324897817 34.06987116074864,-84.26194717434521 34.070089343791523,-84.261820358123074 34.070302258947599,-84.261680228888764 34.070509313967584,-84.261527176339996 34.070709932899128,-84.261361626131588 34.070903557689206,-84.261184038692122 34.071089649736948,-84.260994907943669 34.071267691392244,-84.260794759928146 34.071437187396057,-84.260584151344403 34.071597666258612,-84.260363667999641 34.071748681571322,-84.258364726197328 34.073045730610005,-84.258134929191982 34.073186884602521,-84.257896506679955 34.073317758170241,-84.257650122125995 34.07343798711355,-84.257396461160667 34.073547236854345,-84.257136229671985 34.073645203367583,-84.257122298651311 34.073649727584275,-84.256948086173011 34.073809590460101,-84.256741016186922 34.073980186060624,-84.256523217046393 34.07414126483485,-84.254624040614956 34.075471446270598,-84.254724400109581 34.075597079684485,-84.254869780406139 34.075801719022742,-84.255001988909029 34.076012453715521,-84.255120657537674 34.076228697222078,-84.255225455889686 34.076449847663952,-84.255316092161522 34.076675289500066,-84.255392313961806 34.076904395239524,-84.255453909015031 34.077136527188117,-84.25550070575369 34.077371039223024,-84.255532573797112 34.077607278591138,-84.255549424315745 34.077844587725856,-84.255551210279833 34.078082306077405,-84.255537926591671 34.078319771951413,-84.255509610101271 34.078556324350842,-84.255466339505119 34.078791304815994,-84.255161247280483 34.080198945384261,-84.255103785139212 34.080429419521792,-84.255031936455268 34.080657082246773,-84.254945897195938 34.08088131225167,-84.254868679322044 34.081051343080176,-84.237240155349298 34.081049738427836,-84.237280208789684 34.080978287916651,-84.237412814623895 34.080770733580913,-84.237558247861628 34.080569215695952,-84.237716112647789 34.080374282655967,-84.237885979306 34.080186464932545,-84.238067385508458 34.080006273631433,-84.238259837534528 34.079834199102123,-84.238462811614468 34.079670709603825,-84.238675755355032 34.07951625003173,-84.238898089242582 34.079371240706635,-84.239129208220191 34.079236076231616,-84.23936848333372 34.079111124418603,-84.239615263443213 34.07899672528783,-84.239868876994265 34.078893190142935,-84.240128633844932 34.07880080072416,-84.240393827143151 34.078719808441974,-84.240663735249441 34.078650433693163,-84.240937623699878 34.078592865261356,-84.241214747203784 34.078547259803486,-84.241494351670951 34.078513741423592,-84.241775676262534 34.078492401335303,-84.242057955460552 34.078483297613602,-84.242276383939895 34.078485739280964,-84.242492265400855 34.078344838751413,-84.242669896687488 34.07823372485575,-84.243425452296762 34.077780983762665,-84.243369611642905 34.077645512126423,-84.243292744497253 34.077422098004902,-84.243229822889347 34.077195683226336,-84.243181014084243 34.0769668700799,-84.243146447802232 34.07673626722972,-84.243126215874952 34.076504488095637,-84.243120372002366 34.076272149221445,-84.24312893161121 34.076039868634815,-84.243151871815115 34.075808264203388,-84.243189131476854 34.075577951991299,-84.243461269006659 34.074171169484238,-84.243661428789082 34.072767093882938,-84.243859572282346 34.071365470820695,-84.243900570306963 34.071129462053882,-84.243956514650279 34.070895588727439,-84.244027248504821 34.070664505863085,-84.244112573646277 34.07043686066212,-84.244212250989861 34.070213290693019,-84.244326001261058 34.06999442210612,-84.244453505778935 34.069780867880262,-84.244594407349553 34.069573226106428,-84.244748311267287 34.069372078313052,-84.244914786420949 34.069177987837776,-84.245093366501834 34.068991498250192,-84.245283551310166 34.068813131829963,-84.24548480815632 34.068643388104555,-84.24569657335293 34.068482742450705,-84.245918253793548 34.06833164476361,-84.246149228613717 34.068190518197326,-84.246388850929605 34.068059757980151,-84.246636449649188 34.067939730308254,-84.246891331351378 34.06783077132043,-84.247152782227332 34.067733186157128,-84.247420070078917 34.067647248106269,-84.247692446368404 34.06757319783808,-84.24796914831397 34.067511242731527,-84.248249401024921 34.067461556293573,-84.248295264997253 34.067455515234712,-84.248322821978107 34.067402432756531,-84.248449719882316 34.067189575141107,-84.248589924187186 34.066982580508565,-84.248743044927195 34.066782024461098,-84.248908656228465 34.066588464693275,-84.249086297493619 34.066402439441589,-84.249275474683074 34.066224465988277,-84.249475661689175 34.066055039223322,-84.249686301799471 34.065894630268701,-84.249906809244763 34.065743685168833,-84.250136570827962 34.065602623650683,-84.250374947629027 34.065471837957027,-84.250621276781317 34.06535169175622,-84.250874873314345 34.065242519131296,-84.25113503205796 34.065144623651378,-84.251401029602519 34.065058277527839,-84.251672126309671 34.064983720857754,-84.251947568368251 34.064921160956473,-84.252035386818392 34.064905301684796,-84.25204925128071 34.064800299544302,-84.252093345968433 34.064537746115043,-84.252398611067974 34.063042402369028,-84.252453765203541 34.06280954236135,-84.252523575220749 34.062579422357807,-84.252607847320334 34.062352680810513,-84.252706347584663 34.062129946793526,-84.252818802627843 34.061911838257828,-84.25294490035526 34.06169896031723,-84.253084290830429 34.061491903569909,-84.253236587246619 34.061291242460207,-84.253401367000805 34.061097533685391,-84.253578172866739 34.060911314651555,-84.253766514263859 34.060733101983146,-84.253965868618735 34.060563390090095,-84.254175682815074 34.0604026497967,-84.254395374728318 34.060251327035772,-84.254617325955067 34.06010647236338,-84.254750577269093 34.059099870797915,-84.254794130832508 34.058840403128563,-84.255099396160134 34.057341244692481,-84.255154469659445 34.05710822660803,-84.255224216885622 34.056877947209394,-84.255308443981519 34.056651046156581,-84.255406916874094 34.056428153720624,-84.255519361925792 34.056209889033113,-84.255645466695697 34.055996858366782,-84.255784880808292 34.055789653451782,-84.255937216927606 34.055588849832475,-84.256102051833835 34.055395005269098,-84.256278927599539 34.055208658188938,-84.25646735286206 34.055030326191243,-84.256666804188725 34.054860504609906,-84.256876727531036 34.054699665138081,-84.257096539763594 34.054548254518501,-84.258948432267829 34.053339603420973,-84.275569480814269 34.053149448508812))';



// A transform used to augment the details of a building with dereived peril information
PerilAugmentedBuildingLayout PerilAugmentTransform(BuildingLayout L) := TRANSFORM
  // convert coordinate reference system of point to UTM 16 N so that 
  // distance is calculated in meters
  STRING geom_UTMZ16N := Geometry.toSRID(L.geom,WGS84_SRID,UTMZ16N_SRID);
  
  // Perform distance calculations
  SELF.distanceToFireStation := Geometry.distanceBetween(geom_UTMZ16N, fire_station_point_UTMZ16N,UTMZ16N_SRID);
  SELF.distanceToPoliceStation := Geometry.distanceBetween(geom_UTMZ16N, police_station_point_UTMZ16N,UTMZ16N_SRID);
  
  // Perform point in polygon assertion
  SELF.isAtRiskOfFlood := Geometry.isWithin(L.GEOM,river_flood_buffer_polygon,WGS84_SRID);
  
  // Apply rules for fire and crime
  SELF.fireRiskCategory := MAP(
              SELF.distanceToFireStation <= 500 => 'D',
              SELF.distanceToFireStation <= 1000 => 'C',
              SELF.distanceToFireStation <= 1500 => 'B',
              'A'
            );
  SELF.crimeRiskCategory := MAP(
              SELF.distanceToPoliceStation <= 600 => 'D',
              SELF.distanceToPoliceStation <= 1200 => 'C',
              SELF.distanceToPoliceStation <= 1900 => 'B',
              'A'
            );
  SELF := L;
END;

// Generate a new recordset with Peril details augmented
buildingsWithPerilsRset := PROJECT(buildings_RSet, PerilAugmentTransform(LEFT));

OUTPUT(buildingsWithPerilsRset,NAMED('Buildings_With_Perils')); 

	

Integrating spatial libraries into HPCC Systems

Integrating the spatial libraries into ECL via Inline C++:

Inline C++ within ECL is a convenient mechanism to extend ECL; it does not require building from the HPCC Systems source code and we do not need any external team to support incremental releases.

Installing the spatial C++ libraries, getting your Ubuntu based HPCC platform ready:

To use the spatial libraries within ECL BEGINC++ functions, you will still need to install the spatial libraries [once] to each node on the HPCC Systems cluster.

We installed this onto the 4.X Ubuntu VM image available for download on the HPCC Systems website (http://hpccsystems.com/download/hpcc-vm-image).

The minimum versions of the spatial libraries used in this article:

  • GEOS >= 3.2.2
  • PROJ.4 >= 4.7.0
  • GDAL >= 1.10.1
File : [ install.sh ]
  • Copy "install.sh" via ssh to a folder on the HPCC Systems Ubuntu VM
  • Uncomment the dependencies line if required
  • Execute "install.sh" to install the spatial libraries
#!/bin/bash

# dependencies : if not already installed
# sudo apt-get -y install g++ gcc make cmake bison flex binutils-dev libldap2-dev libcppunit-dev libicu-dev libxalan110-dev zlib1g-dev libboost-regex-dev libssl-dev libarchive-dev  libapr1-dev libaprutil1-dev subversion

# install PROJ.4
sudo apt-get -y install libproj-dev

# install GEOS
sudo apt-get -y install libgeos-dev

# compile and install GDAL
svn co https://svn.osgeo.org/gdal/branches/1.10/gdal gdal_stable

cd gdal_stable
./configure  --prefix=/usr --with-threads --with-ogr --with-geos --without-libtool --with-libz=internal --with-libtiff=internal --with-geotiff=internal --with-gif --without-pg --without-grass --without-libgrass --without-cfitsio --without-pcraster --without-netcdf --with-png --with-jpeg --without-ogdi --without-fme --without-hdf4 --without-hdf5 --without-jasper --without-ecw --without-kakadu --without-mrsid --without-jp2mrsid --without-bsb --without-grib --without-mysql --without-ingres --with-xerces --without-expat --without-odbc --with-curl --without-sqlite3 --without-dwgdirect --without-panorama --without-idb --without-sde --without-perl --without-php --without-ruby --without-python --without-ogpython --with-hide-internal-symbols
sudo make install
cd ../

	
File : [GeometryLite.ecl]

Contains a Geometry module "GeometryLite", truncated for the purposes of demonstration just to include 4 spatial functions : SpatialReferenceForSRID, distanceBetween, tranformToProjection , and hasSpatialRelation.

IMPORT $;

EXPORT GeometryLite := MODULE

      EXPORT ExtLibSpatial := MODULE
        /* SpatialReferenceForSRID

              Given a SRID return the WKT representing the Spatial Projection Details for that SRID
        */
        export STRING SpatialReferenceForSRID(INTEGER4 srid) :=
                BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

using namespace std;

#body
// #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure

char *wktOut;

// determine the spatial reference details
OGRSpatialReference * poSRS = new OGRSpatialReference(NULL);
poSRS->importFromEPSG(srid);

poSRS->exportToWkt(&wktOut);

// copy string into a char array
unsigned len = strlen(wktOut);
char * out = (char *) malloc(len);
for(unsigned i=0; i < len; i++) {
    out[i] = wktOut[i];
}

// free resources
free(wktOut);
OGRSpatialReference::DestroySpatialReference(poSRS);


//return result to ECL
__result = out;

// set length of return string
__lenResult = len;
ENDC++;
/* tranformToProjection

             Transform a geometry from one SRID projection to another
*/
EXPORT string tranformToProjection(const string  geom,  STRING srs1, STRING srs2):=
        BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

        using namespace std;

#body
// #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure

OGRGeometry *thisGeom;
char *wkt;
char* wktIn = (char*) geom;


// determine the spatial reference details
char* wktSRSSourceIn = (char*) srs1;
OGRSpatialReference *sourceSRS = new OGRSpatialReference(NULL);
sourceSRS->importFromWkt(&wktSRSSourceIn);

char* wktSRSTargetIn = (char*) srs2;
OGRSpatialReference *targetSRS = new OGRSpatialReference(NULL);
targetSRS->importFromWkt(&wktSRSTargetIn);


// create geometry from given WKT
OGRErr err = OGRGeometryFactory::createFromWkt(&wktIn, sourceSRS, &thisGeom);

thisGeom->transformTo(targetSRS);

thisGeom->exportToWkt(&wkt);

unsigned len = strlen(wkt);

// copy string into a char array
char * out = (char *) malloc(len);
for(unsigned i=0; i < len; i++) {
    out[i] = wkt[i];
}

//return result to ECL
__result = out;

// set length of return string
__lenResult = len;

free(wkt);
OGRSpatialReference::DestroySpatialReference(sourceSRS);
OGRSpatialReference::DestroySpatialReference(targetSRS);
OGRGeometryFactory::destroyGeometry(thisGeom);
ENDC++;


/* distanceBetween

        Get the distance between the 2 given WKT geometries, the distance unit returned depdends on the SRID used
*/
EXPORT REAL8 distanceBetween(const string  geom1, const string  geom2, STRING srs):=
        BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

        using namespace std;

#body

// #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure


// determine the spatial reference details
char* wktSRSIn = (char*) srs;
OGRSpatialReference * poSRS = new OGRSpatialReference(NULL);
poSRS->importFromWkt(&wktSRSIn);

bool hasAtLeastOneValidRelation = false;

char* wktInLeft = (char*) geom1;
char* wktInRight = (char*) geom2;

OGRGeometry *leftOGRGeom;
OGRGeometry *rightOGRGeom;

bool loadedOK = false;
OGRErr err =  NULL;

err = OGRGeometryFactory::createFromWkt(&wktInLeft, poSRS, &leftOGRGeom);
loadedOK = (err == OGRERR_NONE);

err = OGRGeometryFactory::createFromWkt(&wktInRight, poSRS, &rightOGRGeom);
loadedOK = (err == OGRERR_NONE);

double distance = leftOGRGeom->Distance(rightOGRGeom);

OGRGeometryFactory::destroyGeometry(leftOGRGeom);
OGRGeometryFactory::destroyGeometry(rightOGRGeom);
OGRSpatialReference::DestroySpatialReference(poSRS);

return distance;
ENDC++;

/* hasSpatialRelation

     Do the two given WKT geometries have at least one of the expected relations defined in relationTypeORBits [a single INT containing OR'd bits]

     @see <a href="http://en.wikipedia.org/wiki/DE-9IM">Wikipedia</a>

     usage:
     hasSpatialRelation("POINT(? ?)","POLYGON((? ?,? ?,? ?,? ?,? ?))", rel.WITHIN | rel.OVERLAPS, SRS(4326));


     @param geom1 STRING containing a WKT geometry, left side of predicate assertion
     @param geom2 STRING containing a WKT geometry, right side of predicate assertion
     @param rel INTEGER contains one or more bits representing what spatial relations should be evaluated
     @param srs the WKT Spatial reference details as got from Operation.SRS
*/
EXPORT boolean hasSpatialRelation(const string  geom1, const string  geom2, INTEGER rel, STRING srs):=
        BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

    /**
    Enumeration of all supported relation types
    */
        namespace RelationType {
    enum SpatialPredicate {
        INTERSECTS = 1 << 0,
        TOUCHES = 1 << 1,
        DISJOINT = 1 << 2,
        CROSSES = 1 << 3,
        WITHIN = 1 << 4,
        CONTAINS = 1 << 5,
        OVERLAPS = 1 << 6,
        EQUALS = 1 << 7
    };

    bool isBitwiseSpatialPredicate(int packedInteger, RelationType::SpatialPredicate predicate) {
        return (packedInteger & predicate) == predicate ;
    }
}

using namespace std;

#body

  // #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure


// determine the spatial reference details
char* wktSRSIn = (char*) srs;
OGRSpatialReference * poSRS = new OGRSpatialReference(NULL);
poSRS->importFromWkt(&wktSRSIn);

bool hasAtLeastOneValidRelation = false;

char* wktInLeft = (char*) geom1;
char* wktInRight = (char*) geom2;

OGRGeometry *leftOGRGeom;
OGRGeometry *rightOGRGeom;

bool loadedOK = false;
OGRErr err =  NULL;

// parse geom 1
err = OGRGeometryFactory::createFromWkt(&wktInLeft, poSRS, &leftOGRGeom);
loadedOK = (err == OGRERR_NONE);

if(loadedOK) {
    // parse geom 2
    err = OGRGeometryFactory::createFromWkt(&wktInRight, poSRS, &rightOGRGeom);
    loadedOK = (err == OGRERR_NONE);

    if(loadedOK) {
        // assert if a relation exists
        int relationTypePackedBitwise = rel;
        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::INTERSECTS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Intersects(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::TOUCHES)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Touches(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::DISJOINT)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Disjoint(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::CROSSES)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Crosses(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::WITHIN)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Within(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::CONTAINS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Contains(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::OVERLAPS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Overlaps(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::EQUALS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Equals(rightOGRGeom);
        }
        // clean right
        OGRGeometryFactory::destroyGeometry(rightOGRGeom);
    }
    // clean left
    OGRGeometryFactory::destroyGeometry(leftOGRGeom);
}
// return result
return hasAtLeastOneValidRelation;
ENDC++;

EXPORT SRS :=  SpatialReferenceForSRID;
END;

EXPORT Filter :=  MODULE
      /*
      Bitwise enumeration for all possible Spatial Relations

      Can be combined e.g.

      All WITHIN or TOUCHING OR INTERSECTING = RelationType.WITHIN | RelationType.TOUCHES | RelationType.INTERSECTS
      */
      EXPORT RelationType := ENUM
        (
            INTERSECTS = 1 << 0,
            TOUCHES = 1 << 1,
            DISJOINT = 1 << 2,
            CROSSES = 1 << 3,
            WITHIN = 1 << 4,
            CONTAINS = 1 << 5,
            OVERLAPS = 1 << 6,
            EQUALS = 1 << 7
        );


/*
     hasSpatialRelation

     Does [this] and [thatOther] have one of the bitwise RelationTypes given in [relationTypeORBits] ?
*/
EXPORT BOOLEAN hasSpatialRelation(const string  this, const string  thatOther, INTEGER relationTypeORBits,
                                  INTEGER4 srid) := FUNCTION
                                              STRING srs := ExtLibSpatial.SRS(srid);
return ExtLibSpatial.hasSpatialRelation(this,thatOther,relationTypeORBits, srs);
END;

/*
    isWithin
*/
EXPORT BOOLEAN isWithin(const string  thisGeom, const string  thatOtherGeom, INTEGER4  srid) := FUNCTION
            return hasSpatialRelation(thisGeom,thatOtherGeom,RelationType.WITHIN, srid);
END;
END;

EXPORT Operation :=  MODULE
      EXPORT STRING tranformToProjection(const string geometryWKT, INTEGER4 sourceSRID, INTEGER4 targetSRID) := FUNCTION
                  STRING srs1 :=  ExtLibSpatial.SRS(sourceSRID);
                  STRING srs2 :=  ExtLibSpatial.SRS(targetSRID);

           return ExtLibSpatial.tranformToProjection(geometryWKT,srs1,srs2);
END;

/*
    distanceBetween

    Calculate the distance between 2 points, using the projection given by srid
*/
EXPORT REAL8 distanceBetween(const string  point_A_WKT, const string  point_B_WKT, INTEGER4 srid) := FUNCTION
            STRING srs := ExtLibSpatial.SRS(srid);
return ExtLibSpatial.distanceBetween(point_A_WKT,point_B_WKT,srs);
END;
END;

EXPORT toSRID :=  Operation.tranformToProjection;
EXPORT distanceBetween :=  Operation.distanceBetween;
EXPORT isWithin :=  Filter.isWithin;
END;

	

What about the binary raster files?

Raster files like the GEOTiff format represent a large matrix of peril values where each pixel correlates to x meters; x relates to a fixed resolution. Using the dfuplus HPCC Systems client tool we were able to spray all raster files contained in the HPCC Systems drop-zone into the data/blob fields of a dataset. We then used GDAL through inline C++ to read the blob data for each record (1 record per raster file) and retrieve the raster values corresponding to a X,Y coordinate.

Representing geometries in datasets, best practice not prescription:

EXPORT Layout := RECORD
  STRING GEOM := '';
END;
	

All the geometries used in this blog post have been defined as a WKT (Well-Known-Text) string literal. WKT is widely accepted and widely supported. In pursuit of a canonical way of defining geometries I think that a record layout with a GEOM field containing a WKT string literal is a good start.

For XML datasets based on geo reference data like GML (Geography Mark-up Language); the geometry elements could be translated into WKT for the purposes of canonical search and retrieval. Once relevant matching WKT GEOM records are found through a process of spatial filtering, the associated detail required could then be retrieved from the original GML record.

About the author:

Andrew is a software engineer based in the LexisNexis Risk, Dublin, Ireland office. His role primarily consists of developing geo-spatial risk solutions using Java, he is an advocate of clean code, and has a keen interest in working with distributed systems.

Acknowledgements:

Thanks to everyone for the help along the way:

  • Algorithms team:
    • Charles Kaminski
    • Greg Thompson
    • Jack Coleman
  • GIS Modelling Team, Dublin, Ireland:
    • Daniel Bang
  • Product Development Team, Dublin, Ireland:
    • Jason Moore
    • Greg McRandal
  • HPCC Systems Core Team:
    • Richard Chapman
    • Gordon Smith
    • Gavin Halliday
  • Training:
    • Bob Foreman
    • Alan Daniels

More information:

HPCC Systems 5.0

5.0 has been released! Right on time for the festivities celebrating the third anniversary of our Open Source HPCC Systems platform!

Together with a plethora of tweaks and improvements, it features a number of new capabilities including internationalization and translation into Chinese, Spanish, Hungarian, Serbian and Portuguese, a redesigned ECL Watch which offers a significant face-lift over our traditional management interface, a large number of visualizations that can be driven directly from within ECL, a more natural way to embed foreign functions into ECL, for Python, R, Java and JavaScript (and even a neat teaser on embedding SQLite), and so many more that it would be difficult to enumerate in a single blog post. An extensive list of new features, changes and enhancements is available as a sticky post in our forums.

What makes this event even more exciting is the fact that a number of new integrations are available too: How would you like to use SQL queries through a web services interface, to access published Roxie ECL queries? WsSQL now makes this possible! And how about a complete deployment of an entire HPCC environment with a click of a mouse under Canonical’s Ubuntu Juju? The free HPCC charm is available in the Juju store! And if you use Nagios, the new seamless integration allows direct monitoring of HPCC infrastructure with your existing monitoring system.

This is a great milestone for the HPCC community; so head over to the Downloads section now, and don’t forget to use the forums to tell others about your experience, get answers to your questions and socialize with the rest of the community.

And Happy Birthday, HPCC Systems!

Does Big Data Equal Big Security Problems?

On May 1, the report “Big Data: Preserving Values, Seizing Opportunities” was released by the Executive Office of the President in response to a directive from President Obama to examine the impact of big data technology on society at large.

Big data is a movement in its own right, but after the White House report was released there was an influx of media articles questioning big data, and in particular, the safety of big data. Questions began to circulate on not only how secure the data is but also on the privacy rights of the citizens' records whose very personal information is stored in this data.

For example, GigaOM posted an article titled “It’s 2014. Do You Know Where your Data Is?” and a LinkedIn blog that declared, “Big Data has Big Problems.” Both stories addressed the security and privacy of information stored and utilized for big data purposes.

Recently, I gave an interview to discuss how LexisNexis Risk Solutions protects our data and customer information as well as address the recent concerns raised in the media regarding big data, and what we are doing on our HPCC Systems big data platform to maximize security. Below is the Q&A transcript from the interview.


Moderator: Why is LexisNexis’ information safe and then why should customers trust us?
Flavio Villanustre: We are secure because we have a comprehensive security and privacy program in place. We continuously test our security posture and address any weaknesses that we may find, and we also have state of the art controls around access to the data.

But security goes far beyond just technology. Security isn’t just about making your software secure so that it cannot be breached, you need to also make your processes secure. You need to provide awareness to your employees so that they don’t get socially engineered, for example, and apply controls around security across your entire organization. It’s not just technology, and it’s not just customer support or customer operations.

What are some specific things we do to protect the data?
We do a lot of things. On the administrative side, we have a comprehensive set of security policies, procedures and standards. We provide security through training and we require that employees have knowledge of our policies. We do internal and external (independent third party) risk assessments to ensure that every part of the process is assessed from a risk standpoint, and that controls are commensurate with the exposure and overall risk.

We also employ technical controls, which are things like firewalls and network segmentation, data loss prevention systems and anti-malware protection systems. Administrative and technical controls complement each other.

In addition, we draw a distinction between “security” and “compliance.” Too often, we see organizations “checking the box” to assure themselves that they are compliant with respect to some framework. Our viewpoint is: if we do a very good job with information security (at a technical level and at a process level), compliance more or less takes care of itself.

In general, controls can be classified in three categories: preventive, detective, and corrective. In general, the most important ones are the preventive controls, which are put in place to prevent an attack or mitigate a threat. You need to keep in mind that it is very difficult to undo the damage when sensitive data is leaked or exposed. This is why we put significant emphasis on preventive controls and prioritize prevention. At the same time, we have to always be prepared for the possibility that data might be leaked or exposed, which is where detective controls come in handy, i.e. the sooner we can detect an intrusion or malicious attack, we can minimize potential damage, as opposed to detecting the event weeks or months later.

How does the security of HPCC Systems compare to the threat of other big data systems like Hadoop?
[HPCC Systems] is a lot better. We have been doing this for 20 years, and we built HPCC Systems specifically to support our business. As a result, many of the requirements that we have in our business around security and privacy are also incorporated into the HPCC Systems platform.

By contrast, Hadoop was designed from the ground up to allow people to store massive amounts of data on relatively inexpensive hardware and then be able to perform searches like the "find the needle in a haystack" type of problem. This is what it has been focused on for the longest time, rather than on making it work securely. So the security on Hadoop came as an afterthought, and even the basic authentication mechanisms weren't deployed until a couple of years ago.

I saw that HPCC Systems went open source in 2011. Does that cause any vulnerability issues?
Not at all! On the contrary, this increases security through transparency and collaborative development. Generally speaking, the open source movement – started back in the 80s – is about releasing not just the compiled (or binary) version of the software, but also the programming language version of the software, the source code from which the binary code is generated. Rather than making things less secure, the increased number of reviewers and contributors can identify and correct security issues much faster with their combined efforts, making HPCC Systems even less vulnerable.

When legacy systems are converged onto the HPCC Systems platform, are there any concerns that one needs to be aware of? Some leading journals suggest that technology has progressed so quickly that legacy systems may have issues with merging into a new platform?
It's true that technology has changed, and that it changes very rapidly. It’s no longer a time where we have a new generation of technology every 20 years. Now, a new generation happens every two or three years.

These days, big data encompasses many things: social media, videos, free text – which is not well supported by legacy systems. When you’re trying to deploy HPCC Systems in that environment, there are two ways you can do it. You can say, “Well, I’m going to phase out all my legacy systems completely and move all the data,” but that might not be viable for many companies since they may need to continue to operate while they do that migration, so integration is needed. As with any other integration process, there is some complexity, which could generate some potential security issues in between the interfaces, while you are trying to connect one system to the other and move data. Which is why, when we migrate legacy systems on to the HPCC Systems platform, we play close attention to the security controls that may need to be implemented or refactored as a function of the migration.

Do we ever have risks of losing data?
Well, the reality is that everyone does. There is the myth of complete security, and it’s just that, a myth. There is no way you can say, “I’m 100 percent exempt from having any security issues ever.” Of course, we believe, based on applying best-in-class security practices, having thorough and comprehensive monitoring and surveillance and having a mature set of processes that adapts to the ever changing security threat landscape, that we have a very low risk of losing data.

Maybe I’ve been watching too many political action and sci-fi shows lately, but I was watching 24 and my mind kind of races, which makes me ask: do we ever have anybody try to intentionally hack our systems to see if they can get in?
We don’t have any controls against aliens from outer space, but we do try to intentionally hack into our systems. We have security assessments and penetration testing and we regularly perform both, internally and externally. In addition to our own security experts - who are very well-trained and very knowledgeable of these practices - we also have third parties that we hire on a regular basis, to attempt to break into our systems.

Unfortunately, the number of hackers or wannabe hackers is very large, and you can bet they will be creative in trying to find new ways of breaking into your systems. So, if you’re not performing continuous surveillance and scanning for new threats and attack methodologies, it will potentially expose you in the long run.

What are some challenges that you see with protecting big data in general in the future? And what do you think we need to do to combat those threats?
First of all, we need to draw a distinction between security and privacy. I think the biggest challenges are going to be potentially around privacy, which is a very touchy subject because there is no universal concept of privacy. This distinction is necessary because some people often confuse a perceived lack of privacy with a lack of security.

What is considered acceptable privacy in the US might not be acceptable privacy in Germany or China. What’s privacy to us today is not the type of privacy we used to consider 20 years ago, and it won’t be the privacy 20 years from now. It’s important to have a clear understanding of what the society accepts as privacy, to ensure that you don’t go beyond that. You never want to be seen as creepy, and I can’t define exactly what creepy is, but you will know when you see it.

There can also be better education. For example, when you go to install an application on your smart phone, and the list of permissions pops up, the list is so long, you probably don’t even read it. You say, “I need the application, so accept.” Well, I don’t think that is the right way of doing it. There needs to be some bullet points, saying, “Look, you give me your data, and for giving me your data, I will use your data in this way.” It needs to be clear and understandable by anyone.

At the end of the day, there needs to be an exchange of value between the data owner (the person) and the data user, and that value needs to be measurable and tangible. I am glad to allow others to use my data, if that gives me access to better credit, simplifies my access to online services and makes my children safer in the streets.

Adventures in GraphLand V (Graphland gets a Reality Check)

If you are familiar with other graph systems and have you not read any of my previous blogs please do not read this one. If you do you will rapidly convince yourself that for some inexplicable reason KEL is far more complex than all the other systems. This is not actually true as the early blogs show KEL can play simple just like the other systems; however KEL is also designed to tackle the real world. Today we are going to abandon graphs completely, go back to square one, and ask ourselves the question: “what do we actually know.”

To ask this question I am going to go back to one of our previous datasets (the person dataset) and reproduce it the way it would usually turn up:

This is a lot like the dataset I showed you last time; still has unique IDs, still has names and still has ages. The difference is that in real world data we cannot assume that there is one, single, unique value for each property that an entity has. There are really four distinct cases you need to think about:

  1. There is really, one true value for a given property – and someone has handed it to us
  2. There is really, one true value for a given property – and we have been given multiple possible values for it
  3. There are really multiple possible values for a given property
  4. There are really multiple possible values for a given property and we have been given multiple possible values for each of them

#1 is the case we have been dealing with until this point. It is the one that most simple systems will focus upon but outside of sample data it very rarely occurs unless you have created a system to perform significant knowledge engineering before you hand it to the system you intend to use for knowledge engineering. As we have already seen; KEL can work in this mode but it is really designed for the harder problem.

#2 is the case of something like ‘Age’ – each person has a particular age at a given point in time – but not all the reports of that age will (necessarily) be modern and they will thus not all be the same.

#3 and thus #4 speaks to questions such as: “which states have you lived in” or “which music do you like” – a single person could easily have multiple values for that again recorded over a length of time and with different encoding schemes.

So how does all of this relate to our favorite graph language? Way back in the beginning I introduced the ENTITY statement that allowed you to declare which properties a particular entity had. What I did not discuss was the fact that in that same entity declaration you can specify the MODEL (or behaviors) of the individual properties and I did not dig too deeply into some of the ‘weird stuff’ in the results we were getting back. Now is the time to dig in. First a simple piece of KEL -

Person := ENTITY(FLAT(UID,Name,INTEGER age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

And here is the output (on our new dataset)

A few things to observe:

  1. There is still one row for each entity (UID) even though for some of them we have multiple input rows
  2. Each ‘property value’ (age and name in this case) has its own list of values (child dataset for those of you familiar with ECL).
  3. Each unique value for a given field is only stored once. Thus in our input dataset UID=3 has three instances of the name ‘HELEN’ but it only appears once; however it has a record count of 3
  4. Each ‘property value’ has a record count and also the complete entity has a record count (and the end of the row). Thus UID=4 has a record count of 1 for MATT, 2 for MATTHEW and a total count of 3

Whilst you probably think this all makes sense; here is a question for you – according to the data above how many 16 year olds are there in the data? How many 18 year olds?

To find out the KEL answer is remarkably easy:

Person := ENTITY(FLAT(UID,Name,INTEGER age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person{age,COUNT(GROUP)};

QUERY:Find(_age) <= Person(_age);

This shows a new KEL capability; inter entity aggregation. Previously when we have seen the {} syntax the UID has always been inside the {}. When the UID is there we are simply projecting and aggregating inside each entity. With the UID removed, we are still projecting and aggregating, but it is the entities themselves we are working with.

So; in the syntax above I am grouping by each value of age and counting the number of entities within each group. Here is the result from our test data:

To the specific question; KEL counts UID=4 as being both 16 and 18. UID=1 and UID=3 are both counted as being 46 although both are also counted as being 0.

In KEL a multi-valued property has all of the values (once).

If you use the ‘Find’ query above and pass in 16 you will get back UID=4 and UID=5.

I suspect that by this point; some of you are shaking your head, somehow something is wrong. Somebody cannot have two different ages at the same time!
But suppose we were talking about phone number? Or ‘state lived in’? Suddenly having multiple values for that field would be considered perfectly reasonable.

Of course it comes down to the four groups I mentioned at the beginning; for each field the KEL programmer has the opportunity to express “how many values of this property should there really be?”. By default KEL assumes the data it is handed is correct; we have provided multiple values for age and therefore it has treated it as a multi-valued property!

Is this repairable within KEL? Yes – with logic statements. Here is a rework of the previous code in which the KEL programmer has begun to do the work of computing single values from multiple reports (case #2).

Person := ENTITY(FLAT(UID,Name,INTEGER in_age=age));

USE File_Person(FLAT,Person);

Person: => Age := in_age$Max;

QUERY:Dump <= Person{UID,{in_age},Age,{Name}};

In the first line I have now moved the ingested data into a multi-valued field call in_age. KEL does not enforce this but I recommend using a consistent naming convention for ‘things you are bringing in that you know have the wrong model’.

Now, you should recognize the logic statement in the middle and you may even be able to read it. I want to read it with ‘new eyes’.

Firstly the scope is

Person:

This means all of the values for a given entity are in scope for this logic statement.

Age :=

Is now creating a new single value

in_age$Max

finds the maximum of all the ages within the entity (assuming ages only ever go up; which is true for everyone other than my mother).

The query statement has a new feature – some of the columns are inside extra {}. This means ‘put out a multi-valued property AS a multi-valued property’. The result is:

This looks a lot like the previous except we now have an Age column that is single valued per entity and behaves exactly as the nice case we handled in the first four blogs. In the ‘real world’ you would not use the in_age property once you had produced the clean version of it.

We have seen that KEL can handle the clean case like everyone else and it actually has a mechanism to handle the dirty case with full control for the user. Now for some magic: KEL has the ability to handle the dirty case as if it were the clean case: automatically.

Person := ENTITY(FLAT(UID,Name,INTEGER Age),MODEL(*));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

The new piece here is the MODEL statement. MODEL(*) is identical to MODEL(UID,NAME,Age), it means ‘every field is a single valued field.’ If we ingest the data using this model we get this:

Every field value has been hammered down to one single value and everything can be treated ‘nicely’. The cost of this automated magic is that some data has been ‘scrunched’. If you look at the table above the _name_flags and _age_flags actually tell you when the scrunching has occurred. The ‘2’ values are those places where multiple values were hammered into one. KEL has also generated a ‘SanityCheck’ attribute that will give complete stats about how good (or bad) the automated scrunching was.

Whether or not you want automated scrunching will depend upon your own character and the nature of the data and in particular it can depend upon the nature of the individual fields. This can be captured in the KEL. As an example; suppose we consider it reasonable to have multiple names but only one age.

Person := ENTITY(FLAT(UID,Name,INTEGER Age),MODEL(UID,{Name},Age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

Producing

Using this mechanism you can force ‘nice world’ and then selectively improve the conversion as you see fit.

For some the foregoing might have been a little dull, for others it was probably a little scary. I am, however, unrepentant. On many systems GraphLand would be a fantasy world of amazing things largely detached from the reality of large scale Big Data analytics. GraphLand is not designed to be a fantasy world; it is designed to bring large scale graph analytics to reality.

Adventures in Graphland Series
Part I - Adventures in GraphLand
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand (Graphland gets a Reality Check)
Part VI - Adventures in GraphLand (Coprophagy)

Adventures in GraphLand IV (The Subgraph Matching Problem)

I once opened a meeting of extremely senior folks all of whom specialized in large scale graph analytics by stating: “There is no such thing as an interesting large scale graph problem; but there are lots of huge graphs containing interesting small problems that we can’t find.” My statement was clearly hyperbolic and exceptions do exist although I would contend it is rather closer to the truth than most people would like to admit. For reasons covered under ‘The Wayward Chicken’ a huge highly connected graph begins to look a lot like a population average. Therefore if our graph analytics are really going to buy us anything we need to be able to narrow down the area of the graph across which we are working.

One of the most important graph narrowing mechanisms is the Subgraph Isomorphism problem or more informally “subgraph matching” problem. The core idea here is that I have a little graph in my hand of interconnected facts; I want to find a place (or places) in a much larger graph where that little graph appears. This seemingly simple ability is the engine behind our patented MEOW technology that maps fragments of entities in documents to social networks. It is the same idea that drives our advanced person search capabilities. In short if there was a single graph problem that we cared about being able to do well; it would be this one. This leads me neatly to the second major feature of the Subgraph Matching problem; the general case is NP complete. Put another way, it is excruciatingly difficult to tackle this problem well.

This all leads us to the obvious question; why have we progressed this far through our tour of dataland without touching upon the subgraph matching problem? The answer is twofold. Firstly was my slightly weaselly use of the phrase: ’in the general case.’ Whilst some sub-graph matching problems are horrible some are relatively benign. I believe that through judicious precomputation and intelligent query design it is possible to render many, many subgraph matching problems tractable. Secondly KEL is designed so that the “subgraph matching problem” (and most other graph queries) just ‘drop out’ once the logic elements have been defined. The purpose of the rest of this blog entry is to persuade you that this is the case.

For the following we will use the datasets defined in the previous three blogs. To start simply: “Find all the people that own or drive a Honda”:


QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') ) );

Here the outer level of our filter expression is ‘Person’; so we have already confined the search to person nodes. You can think of the ‘inside’ of a filter condition as: “go to the node and then try…” So for each Person node we will walk down PerVeh link if there is one, then follow the Veh entity and check if it is a Honda. A quick review of the Person->Vehicle table we produced in the last blog shows that the answer should be Helen and Matthew; running the code produces:

Suppose we decide we want to hunt for people that own or drive a Honda AND a car which is blue. A quick glance at the above code probably tells you that:


QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') )
                      ,EXISTS(PerVeh.Veh(Colour='BLUE') ) );

Returns the one person able to satisfy both conditions.

COUNT can be swapped for EXISTS to ensure a given entity has multiple matching item. For example: “who has at least two older relatives?”


QUERY: Find <= Person( COUNT(Relationship.whoelse(Person.Age<Age)) >= 2 );

The only ‘hard’ part in the above is the Person.Age<Age; who is Person and who is ‘local’. Remember the way to read this expression is that you start at the node represented by the outer level, each time you meet () following and entity or association you are evaluating that node and you travel whenever you see a period. Each time you meet () following an aggregate word (COUNT, EXISTS, SUM etc) you are about to follow multiple paths to accumulate a result). So here we start at a given person node. We scan down each the relationships and we travel to each whoelse in turn. We then evaluate that whoelse node. So inside those () the ‘Age’ refers to the local node (the whoelse). To find the age of the original person we were scanning we need to use Person.Age. Thus to encode ‘has an older relative’, the relative has an Age > Person.Age.

Using exactly these techniques it is possible to specify an arbitrary (constant) subgraph and return all the matches to it. The slightly tedious part is that you have to write new code each time; bit of a shame you can’t just parameterize everything. Of course you can:


QUERY: Find(_Age,Name2) <= Person( _Age, Relationship.whoelse(Name=Name2) );

Illustrates a number of language features:

  • You can parameterize queries; in 0.4 it simply means that a parameterized module is created to execute the query. In 0.5 it will also cause keys to be built and a roxie service to be available.
  • The _Param case is special; it is for the very common case where you wish to filter an entity based upon the equivalent of a property value. Person( _Age…. ) means ‘of all those people with Age=_Age value’
  • The nested relationship query shows the more ‘regular’ use of a parameter (Name2).
  • For this query I have also made use of the implicit EXISTs. A dataset used in a Boolean context creates a false if the dataset is empty.

The foregoing hopefully illustrates that the subgraph matching problem can be specified within KEL; what we have not done is made any use of the properties that we have spent three blogs manufacturing. So here is a harder question: “which people co-own a vehicle?”


QUERY: Find <= Person( PerVeh.Veh(Person.PerVeh.Type='OWNS', nOwners > 1) );

Only contains things you have seen before. Notice the Person.PerVeh.Type to pull information from up-path and the nOwners that behaves as if it belongs to vehicle (even though it was manufactured by travelling to other nodes). The latter point is crucial; one of the driving costs of the NP complete nature of the subgraph matching problem is the complexity of the query. Having given ourselves a way to render the run-time part of the query simpler we have significantly improved performance.

That about wraps up the subgraph matching problem; I do however wish to finish with a warning:

We have seen the code to find people with two older relatives. I suspect that if I were to set an exercise to find people that have access to at least two vehicles you would produce:


QUERY: Find <= Person( COUNT(PerVeh.Veh) > 2);

Unfortunately the answer returned would be wrong. David has two relationships with vehicle 2 and thus when the full path PerVeh.Veh is evaluated for ‘David’ three combinations in total are found (one for vehicle 1, two for vehicle 2). I could in this case, quite rightly, simply dismiss the problem as an evil of late typing (which it is). There are legitimate cases though where you need to know the end-points you have reached on a graph walk independent of the route taken to get there. Here is the syntax you need; what it means and how it works in the subject of a later blog.


QUERY: Find <= Person( COUNT(PerVeh{Veh}) > 2 );

Adventures in Graphland Series
Part I - Adventures in GraphLand
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand (Graphland gets a Reality Check)
Part VI - Adventures in GraphLand (Coprophagy)

Contact Us

email us   Email us
Toll-free   US: 1.877.316.9669
International   Intl: 1.678.694.2200

Sign up to get updates through
our social media channels:

facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

Get Started