Blogs

HPCC Systems 5.0

5.0 has been released! Right on time for the festivities celebrating the third anniversary of our Open Source HPCC Systems platform!

Together with a plethora of tweaks and improvements, it features a number of new capabilities including internationalization and translation into Chinese, Spanish, Hungarian, Serbian and Portuguese, a redesigned ECL Watch which offers a significant face-lift over our traditional management interface, a large number of visualizations that can be driven directly from within ECL, a more natural way to embed foreign functions into ECL, for Python, R, Java and JavaScript (and even a neat teaser on embedding SQLite), and so many more that it would be difficult to enumerate in a single blog post. An extensive list of new features, changes and enhancements is available as a sticky post in our forums.

What makes this event even more exciting is the fact that a number of new integrations are available too: How would you like to use SQL queries through a web services interface, to access published Roxie ECL queries? WsSQL now makes this possible! And how about a complete deployment of an entire HPCC environment with a click of a mouse under Canonical’s Ubuntu Juju? The free HPCC charm is available in the Juju store! And if you use Nagios, the new seamless integration allows direct monitoring of HPCC infrastructure with your existing monitoring system.

This is a great milestone for the HPCC community; so head over to the Downloads section now, and don’t forget to use the forums to tell others about your experience, get answers to your questions and socialize with the rest of the community.

And Happy Birthday, HPCC Systems!

Does Big Data Equal Big Security Problems?

On May 1, the report “Big Data: Preserving Values, Seizing Opportunities” was released by the Executive Office of the President in response to a directive from President Obama to examine the impact of big data technology on society at large.

Big data is a movement in its own right, but after the White House report was released there was an influx of media articles questioning big data, and in particular, the safety of big data. Questions began to circulate on not only how secure the data is but also on the privacy rights of the citizens' records whose very personal information is stored in this data.

For example, GigaOM posted an article titled “It’s 2014. Do You Know Where your Data Is?” and a LinkedIn blog that declared, “Big Data has Big Problems.” Both stories addressed the security and privacy of information stored and utilized for big data purposes.

Recently, I gave an interview to discuss how LexisNexis Risk Solutions protects our data and customer information as well as address the recent concerns raised in the media regarding big data, and what we are doing on our HPCC Systems big data platform to maximize security. Below is the Q&A transcript from the interview.


Moderator: Why is LexisNexis’ information safe and then why should customers trust us?
Flavio Villanustre: We are secure because we have a comprehensive security and privacy program in place. We continuously test our security posture and address any weaknesses that we may find, and we also have state of the art controls around access to the data.

But security goes far beyond just technology. Security isn’t just about making your software secure so that it cannot be breached, you need to also make your processes secure. You need to provide awareness to your employees so that they don’t get socially engineered, for example, and apply controls around security across your entire organization. It’s not just technology, and it’s not just customer support or customer operations.

What are some specific things we do to protect the data?
We do a lot of things. On the administrative side, we have a comprehensive set of security policies, procedures and standards. We provide security through training and we require that employees have knowledge of our policies. We do internal and external (independent third party) risk assessments to ensure that every part of the process is assessed from a risk standpoint, and that controls are commensurate with the exposure and overall risk.

We also employ technical controls, which are things like firewalls and network segmentation, data loss prevention systems and anti-malware protection systems. Administrative and technical controls complement each other.

In addition, we draw a distinction between “security” and “compliance.” Too often, we see organizations “checking the box” to assure themselves that they are compliant with respect to some framework. Our viewpoint is: if we do a very good job with information security (at a technical level and at a process level), compliance more or less takes care of itself.

In general, controls can be classified in three categories: preventive, detective, and corrective. In general, the most important ones are the preventive controls, which are put in place to prevent an attack or mitigate a threat. You need to keep in mind that it is very difficult to undo the damage when sensitive data is leaked or exposed. This is why we put significant emphasis on preventive controls and prioritize prevention. At the same time, we have to always be prepared for the possibility that data might be leaked or exposed, which is where detective controls come in handy, i.e. the sooner we can detect an intrusion or malicious attack, we can minimize potential damage, as opposed to detecting the event weeks or months later.

How does the security of HPCC Systems compare to the threat of other big data systems like Hadoop?
[HPCC Systems] is a lot better. We have been doing this for 20 years, and we built HPCC Systems specifically to support our business. As a result, many of the requirements that we have in our business around security and privacy are also incorporated into the HPCC Systems platform.

By contrast, Hadoop was designed from the ground up to allow people to store massive amounts of data on relatively inexpensive hardware and then be able to perform searches like the "find the needle in a haystack" type of problem. This is what it has been focused on for the longest time, rather than on making it work securely. So the security on Hadoop came as an afterthought, and even the basic authentication mechanisms weren't deployed until a couple of years ago.

I saw that HPCC Systems went open source in 2011. Does that cause any vulnerability issues?
Not at all! On the contrary, this increases security through transparency and collaborative development. Generally speaking, the open source movement – started back in the 80s – is about releasing not just the compiled (or binary) version of the software, but also the programming language version of the software, the source code from which the binary code is generated. Rather than making things less secure, the increased number of reviewers and contributors can identify and correct security issues much faster with their combined efforts, making HPCC Systems even less vulnerable.

When legacy systems are converged onto the HPCC Systems platform, are there any concerns that one needs to be aware of? Some leading journals suggest that technology has progressed so quickly that legacy systems may have issues with merging into a new platform?
It's true that technology has changed, and that it changes very rapidly. It’s no longer a time where we have a new generation of technology every 20 years. Now, a new generation happens every two or three years.

These days, big data encompasses many things: social media, videos, free text – which is not well supported by legacy systems. When you’re trying to deploy HPCC Systems in that environment, there are two ways you can do it. You can say, “Well, I’m going to phase out all my legacy systems completely and move all the data,” but that might not be viable for many companies since they may need to continue to operate while they do that migration, so integration is needed. As with any other integration process, there is some complexity, which could generate some potential security issues in between the interfaces, while you are trying to connect one system to the other and move data. Which is why, when we migrate legacy systems on to the HPCC Systems platform, we play close attention to the security controls that may need to be implemented or refactored as a function of the migration.

Do we ever have risks of losing data?
Well, the reality is that everyone does. There is the myth of complete security, and it’s just that, a myth. There is no way you can say, “I’m 100 percent exempt from having any security issues ever.” Of course, we believe, based on applying best-in-class security practices, having thorough and comprehensive monitoring and surveillance and having a mature set of processes that adapts to the ever changing security threat landscape, that we have a very low risk of losing data.

Maybe I’ve been watching too many political action and sci-fi shows lately, but I was watching 24 and my mind kind of races, which makes me ask: do we ever have anybody try to intentionally hack our systems to see if they can get in?
We don’t have any controls against aliens from outer space, but we do try to intentionally hack into our systems. We have security assessments and penetration testing and we regularly perform both, internally and externally. In addition to our own security experts - who are very well-trained and very knowledgeable of these practices - we also have third parties that we hire on a regular basis, to attempt to break into our systems.

Unfortunately, the number of hackers or wannabe hackers is very large, and you can bet they will be creative in trying to find new ways of breaking into your systems. So, if you’re not performing continuous surveillance and scanning for new threats and attack methodologies, it will potentially expose you in the long run.

What are some challenges that you see with protecting big data in general in the future? And what do you think we need to do to combat those threats?
First of all, we need to draw a distinction between security and privacy. I think the biggest challenges are going to be potentially around privacy, which is a very touchy subject because there is no universal concept of privacy. This distinction is necessary because some people often confuse a perceived lack of privacy with a lack of security.

What is considered acceptable privacy in the US might not be acceptable privacy in Germany or China. What’s privacy to us today is not the type of privacy we used to consider 20 years ago, and it won’t be the privacy 20 years from now. It’s important to have a clear understanding of what the society accepts as privacy, to ensure that you don’t go beyond that. You never want to be seen as creepy, and I can’t define exactly what creepy is, but you will know when you see it.

There can also be better education. For example, when you go to install an application on your smart phone, and the list of permissions pops up, the list is so long, you probably don’t even read it. You say, “I need the application, so accept.” Well, I don’t think that is the right way of doing it. There needs to be some bullet points, saying, “Look, you give me your data, and for giving me your data, I will use your data in this way.” It needs to be clear and understandable by anyone.

At the end of the day, there needs to be an exchange of value between the data owner (the person) and the data user, and that value needs to be measurable and tangible. I am glad to allow others to use my data, if that gives me access to better credit, simplifies my access to online services and makes my children safer in the streets.

Adventures in GraphLand V (Graphland gets a Reality Check)

If you are familiar with other graph systems and have you not read any of my previous blogs please do not read this one. If you do you will rapidly convince yourself that for some inexplicable reason KEL is far more complex than all the other systems. This is not actually true as the early blogs show KEL can play simple just like the other systems; however KEL is also designed to tackle the real world. Today we are going to abandon graphs completely, go back to square one, and ask ourselves the question: “what do we actually know.”

To ask this question I am going to go back to one of our previous datasets (the person dataset) and reproduce it the way it would usually turn up:

This is a lot like the dataset I showed you last time; still has unique IDs, still has names and still has ages. The difference is that in real world data we cannot assume that there is one, single, unique value for each property that an entity has. There are really four distinct cases you need to think about:

  1. There is really, one true value for a given property – and someone has handed it to us
  2. There is really, one true value for a given property – and we have been given multiple possible values for it
  3. There are really multiple possible values for a given property
  4. There are really multiple possible values for a given property and we have been given multiple possible values for each of them

#1 is the case we have been dealing with until this point. It is the one that most simple systems will focus upon but outside of sample data it very rarely occurs unless you have created a system to perform significant knowledge engineering before you hand it to the system you intend to use for knowledge engineering. As we have already seen; KEL can work in this mode but it is really designed for the harder problem.

#2 is the case of something like ‘Age’ – each person has a particular age at a given point in time – but not all the reports of that age will (necessarily) be modern and they will thus not all be the same.

#3 and thus #4 speaks to questions such as: “which states have you lived in” or “which music do you like” – a single person could easily have multiple values for that again recorded over a length of time and with different encoding schemes.

So how does all of this relate to our favorite graph language? Way back in the beginning I introduced the ENTITY statement that allowed you to declare which properties a particular entity had. What I did not discuss was the fact that in that same entity declaration you can specify the MODEL (or behaviors) of the individual properties and I did not dig too deeply into some of the ‘weird stuff’ in the results we were getting back. Now is the time to dig in. First a simple piece of KEL -

Person := ENTITY(FLAT(UID,Name,INTEGER age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

And here is the output (on our new dataset)

A few things to observe:

  1. There is still one row for each entity (UID) even though for some of them we have multiple input rows
  2. Each ‘property value’ (age and name in this case) has its own list of values (child dataset for those of you familiar with ECL).
  3. Each unique value for a given field is only stored once. Thus in our input dataset UID=3 has three instances of the name ‘HELEN’ but it only appears once; however it has a record count of 3
  4. Each ‘property value’ has a record count and also the complete entity has a record count (and the end of the row). Thus UID=4 has a record count of 1 for MATT, 2 for MATTHEW and a total count of 3

Whilst you probably think this all makes sense; here is a question for you – according to the data above how many 16 year olds are there in the data? How many 18 year olds?

To find out the KEL answer is remarkably easy:

Person := ENTITY(FLAT(UID,Name,INTEGER age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person{age,COUNT(GROUP)};

QUERY:Find(_age) <= Person(_age);

This shows a new KEL capability; inter entity aggregation. Previously when we have seen the {} syntax the UID has always been inside the {}. When the UID is there we are simply projecting and aggregating inside each entity. With the UID removed, we are still projecting and aggregating, but it is the entities themselves we are working with.

So; in the syntax above I am grouping by each value of age and counting the number of entities within each group. Here is the result from our test data:

To the specific question; KEL counts UID=4 as being both 16 and 18. UID=1 and UID=3 are both counted as being 46 although both are also counted as being 0.

In KEL a multi-valued property has all of the values (once).

If you use the ‘Find’ query above and pass in 16 you will get back UID=4 and UID=5.

I suspect that by this point; some of you are shaking your head, somehow something is wrong. Somebody cannot have two different ages at the same time!
But suppose we were talking about phone number? Or ‘state lived in’? Suddenly having multiple values for that field would be considered perfectly reasonable.

Of course it comes down to the four groups I mentioned at the beginning; for each field the KEL programmer has the opportunity to express “how many values of this property should there really be?”. By default KEL assumes the data it is handed is correct; we have provided multiple values for age and therefore it has treated it as a multi-valued property!

Is this repairable within KEL? Yes – with logic statements. Here is a rework of the previous code in which the KEL programmer has begun to do the work of computing single values from multiple reports (case #2).

Person := ENTITY(FLAT(UID,Name,INTEGER in_age=age));

USE File_Person(FLAT,Person);

Person: => Age := in_age$Max;

QUERY:Dump <= Person{UID,{in_age},Age,{Name}};

In the first line I have now moved the ingested data into a multi-valued field call in_age. KEL does not enforce this but I recommend using a consistent naming convention for ‘things you are bringing in that you know have the wrong model’.

Now, you should recognize the logic statement in the middle and you may even be able to read it. I want to read it with ‘new eyes’.

Firstly the scope is

Person:

This means all of the values for a given entity are in scope for this logic statement.

Age :=

Is now creating a new single value

in_age$Max

finds the maximum of all the ages within the entity (assuming ages only ever go up; which is true for everyone other than my mother).

The query statement has a new feature – some of the columns are inside extra {}. This means ‘put out a multi-valued property AS a multi-valued property’. The result is:

This looks a lot like the previous except we now have an Age column that is single valued per entity and behaves exactly as the nice case we handled in the first four blogs. In the ‘real world’ you would not use the in_age property once you had produced the clean version of it.

We have seen that KEL can handle the clean case like everyone else and it actually has a mechanism to handle the dirty case with full control for the user. Now for some magic: KEL has the ability to handle the dirty case as if it were the clean case: automatically.

Person := ENTITY(FLAT(UID,Name,INTEGER Age),MODEL(*));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

The new piece here is the MODEL statement. MODEL(*) is identical to MODEL(UID,NAME,Age), it means ‘every field is a single valued field.’ If we ingest the data using this model we get this:

Every field value has been hammered down to one single value and everything can be treated ‘nicely’. The cost of this automated magic is that some data has been ‘scrunched’. If you look at the table above the _name_flags and _age_flags actually tell you when the scrunching has occurred. The ‘2’ values are those places where multiple values were hammered into one. KEL has also generated a ‘SanityCheck’ attribute that will give complete stats about how good (or bad) the automated scrunching was.

Whether or not you want automated scrunching will depend upon your own character and the nature of the data and in particular it can depend upon the nature of the individual fields. This can be captured in the KEL. As an example; suppose we consider it reasonable to have multiple names but only one age.

Person := ENTITY(FLAT(UID,Name,INTEGER Age),MODEL(UID,{Name},Age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

Producing

Using this mechanism you can force ‘nice world’ and then selectively improve the conversion as you see fit.

For some the foregoing might have been a little dull, for others it was probably a little scary. I am, however, unrepentant. On many systems GraphLand would be a fantasy world of amazing things largely detached from the reality of large scale Big Data analytics. GraphLand is not designed to be a fantasy world; it is designed to bring large scale graph analytics to reality.

Adventures in Graphland Series
Part I - Adventures in GraphLand (Part I)
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

Adventures in GraphLand IV (The Subgraph Matching Problem)

I once opened a meeting of extremely senior folks all of whom specialized in large scale graph analytics by stating: “There is no such thing as an interesting large scale graph problem; but there are lots of huge graphs containing interesting small problems that we can’t find.” My statement was clearly hyperbolic and exceptions do exist although I would contend it is rather closer to the truth than most people would like to admit. For reasons covered under ‘The Wayward Chicken’ a huge highly connected graph begins to look a lot like a population average. Therefore if our graph analytics are really going to buy us anything we need to be able to narrow down the area of the graph across which we are working.

One of the most important graph narrowing mechanisms is the Subgraph Isomorphism problem or more informally “subgraph matching” problem. The core idea here is that I have a little graph in my hand of interconnected facts; I want to find a place (or places) in a much larger graph where that little graph appears. This seemingly simple ability is the engine behind our patented MEOW technology that maps fragments of entities in documents to social networks. It is the same idea that drives our advanced person search capabilities. In short if there was a single graph problem that we cared about being able to do well; it would be this one. This leads me neatly to the second major feature of the Subgraph Matching problem; the general case is NP complete. Put another way, it is excruciatingly difficult to tackle this problem well.

This all leads us to the obvious question; why have we progressed this far through our tour of dataland without touching upon the subgraph matching problem? The answer is twofold. Firstly was my slightly weaselly use of the phrase: ’in the general case.’ Whilst some sub-graph matching problems are horrible some are relatively benign. I believe that through judicious precomputation and intelligent query design it is possible to render many, many subgraph matching problems tractable. Secondly KEL is designed so that the “subgraph matching problem” (and most other graph queries) just ‘drop out’ once the logic elements have been defined. The purpose of the rest of this blog entry is to persuade you that this is the case.

For the following we will use the datasets defined in the previous three blogs. To start simply: “Find all the people that own or drive a Honda”:


QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') ) );

Here the outer level of our filter expression is ‘Person’; so we have already confined the search to person nodes. You can think of the ‘inside’ of a filter condition as: “go to the node and then try…” So for each Person node we will walk down PerVeh link if there is one, then follow the Veh entity and check if it is a Honda. A quick review of the Person->Vehicle table we produced in the last blog shows that the answer should be Helen and Matthew; running the code produces:

Suppose we decide we want to hunt for people that own or drive a Honda AND a car which is blue. A quick glance at the above code probably tells you that:


QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') )
                      ,EXISTS(PerVeh.Veh(Colour='BLUE') ) );

Returns the one person able to satisfy both conditions.

COUNT can be swapped for EXISTS to ensure a given entity has multiple matching item. For example: “who has at least two older relatives?”


QUERY: Find <= Person( COUNT(Relationship.whoelse(Person.Age<Age)) >= 2 );

The only ‘hard’ part in the above is the Person.Age<Age; who is Person and who is ‘local’. Remember the way to read this expression is that you start at the node represented by the outer level, each time you meet () following and entity or association you are evaluating that node and you travel whenever you see a period. Each time you meet () following an aggregate word (COUNT, EXISTS, SUM etc) you are about to follow multiple paths to accumulate a result). So here we start at a given person node. We scan down each the relationships and we travel to each whoelse in turn. We then evaluate that whoelse node. So inside those () the ‘Age’ refers to the local node (the whoelse). To find the age of the original person we were scanning we need to use Person.Age. Thus to encode ‘has an older relative’, the relative has an Age > Person.Age.

Using exactly these techniques it is possible to specify an arbitrary (constant) subgraph and return all the matches to it. The slightly tedious part is that you have to write new code each time; bit of a shame you can’t just parameterize everything. Of course you can:


QUERY: Find(_Age,Name2) <= Person( _Age, Relationship.whoelse(Name=Name2) );

Illustrates a number of language features:

  • You can parameterize queries; in 0.4 it simply means that a parameterized module is created to execute the query. In 0.5 it will also cause keys to be built and a roxie service to be available.
  • The _Param case is special; it is for the very common case where you wish to filter an entity based upon the equivalent of a property value. Person( _Age…. ) means ‘of all those people with Age=_Age value’
  • The nested relationship query shows the more ‘regular’ use of a parameter (Name2).
  • For this query I have also made use of the implicit EXISTs. A dataset used in a Boolean context creates a false if the dataset is empty.

The foregoing hopefully illustrates that the subgraph matching problem can be specified within KEL; what we have not done is made any use of the properties that we have spent three blogs manufacturing. So here is a harder question: “which people co-own a vehicle?”


QUERY: Find <= Person( PerVeh.Veh(Person.PerVeh.Type='OWNS', nOwners > 1) );

Only contains things you have seen before. Notice the Person.PerVeh.Type to pull information from up-path and the nOwners that behaves as if it belongs to vehicle (even though it was manufactured by travelling to other nodes). The latter point is crucial; one of the driving costs of the NP complete nature of the subgraph matching problem is the complexity of the query. Having given ourselves a way to render the run-time part of the query simpler we have significantly improved performance.

That about wraps up the subgraph matching problem; I do however wish to finish with a warning:

We have seen the code to find people with two older relatives. I suspect that if I were to set an exercise to find people that have access to at least two vehicles you would produce:


QUERY: Find <= Person( COUNT(PerVeh.Veh) > 2);

Unfortunately the answer returned would be wrong. David has two relationships with vehicle 2 and thus when the full path PerVeh.Veh is evaluated for ‘David’ three combinations in total are found (one for vehicle 1, two for vehicle 2). I could in this case, quite rightly, simply dismiss the problem as an evil of late typing (which it is). There are legitimate cases though where you need to know the end-points you have reached on a graph walk independent of the route taken to get there. Here is the syntax you need; what it means and how it works in the subject of a later blog.


QUERY: Find <= Person( COUNT(PerVeh{Veh}) > 2 );

Adventures in Graphland Series
Part I - Adventures in GraphLand (Part I)
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

Adventures in GraphLand III (The Underground Network)

If you have digested my blog on The Wayward Chicken then as you gaze out upon GraphLand you should no longer be thinking of a land of clearings connect via a maze of pathways. Rather we have a modern industrial landscape; every node is a factory with a rail link to the neighboring factories. Raw goods (data) regularly travel towards each node to be processed into a more refined good which may in turn become the raw material for a neighboring factory. Conceptually, (though not factually) the process happens for every node synchronously until everything is complete. Only then are orders serviced and that process usually just amounts to assembling the parts already made (quite possibly all within a single factory).

Our scene readily maps onto pure graph theory which would like us to believe the world consists of nodes and edges (balls and sticks) and that everything can be mapped onto this model. In truth it can; but it requires a shift of mind that I hope will repulse you as much as it does me.

In our opening data model we had people and relationships between people and all our data fitted nicely into the model. How do I add the concept of ‘Vehicles’ to GraphLand? To stick with my stick and ball model I now have to give every node an extra property the ‘Type’ which would be ‘Person’ or ‘Vehicle’. I then have to add vehicle properties to my node (which will be null for people). I now have to adjust all of my processing logic to ensure I am filtering (at process time) down the types of entity I am interested in.

In fairness there are benefits to this ‘model’; it is a data model which by definition can handle anything. The cost is that the logic around how to process the entity types has to be pushed down into the queries and worse all of the resolution and checking happens at run time. To return to our analogy; every factory could process different things a different way and at different times and we need to enter each factory to find out what it does. This could actually be a motivator behind the Wayward Chicken programming method.

The KEL model, and thus GraphLand, behaves very differently. In our first blog I introduced Entities and Associations and then used the term interchangeably with nodes and edges. This is not strictly correct. Within KEL you can define multiple entity types and multiple types of associations inter or intra entity. You can think of this as having multiple layers of factories burrowing underground with various pipes and tubes carrying goods from layer to layer. Within a single layer everything is homogenous and synchronous and the transport between layers is also homogenous and synchronous but the result is that complex heterogeneous queries can be achieved.

Enough theory – let’s get to the data.

First with apologies to all car buffs out there; here is the Bayliss interpretation of all interesting facts about our vehicles.

Secondly, relationships file between people and vehicles. Note this file contains information regarding two different types of relationship between Person and Vehicle, Owns and Drives. The ‘per’ column is the entity ID for the person, the ‘veh’ column is the entity ID for the Vehicle.

These can be declared and used within GraphLand as easily as before:

It should be noted that if I personally were doing this I would have split my relationship file into two and imported one as an association ‘owns’ and one as ‘drives’; I would then have had rather nicer association names, compile time checking and nicer queries. However, for the sake of showing off some KEL features, I am bringing in the associations with the types intact and we will validate them at query time.

We may remember from our previous blog that associations for a given entity are now accessible as if they were child records of each entity. Thus the PerVeh association now exists for people and for vehicles.

This simple query provides a list of people with the vehicles they drive or own:


QUERY:Dump1 <= Person{UID,Name,
                PerVeh.Veh{UID,Make,Colour}};

Produces

Of course we do not really want to use KEL simply to dump out data we already had. What we really want is to compute new information. So here is a question for you: “How many vehicles is each person in the list the sole owner of?”

I would imagine that most of you started:

COUNT(PerVeh.Veh(< Ouch, this is going to hurt> ))

You actually can make it work that way; but it is much better if you decompose the problem. The first question one should ask is: “Which vehicles have only one owner?” This becomes very simple:


Vehicle: => nOwners := COUNT(PerVeh(Type='OWNS'));

Within our underground network of vehicles it is very easy to count the number of people relationships. The new thing is that the associations can have values (such as type) and they can be part of filters too.

Then within our surface entity (people) it becomes fairly easy to compute the number of vehicles a person is the sole owner of:


Person: => nWholeOwned := COUNT(PerVeh.Veh(Person.PerVeh.Type='OWNS',nOwners=1)):

The COUNT and the nOwners=1 should be familiar. The slight ugliness is the checking that we are dealing with ownership rather than simply driving; you may remember my saying I didn’t like late-typing!

We have to check Type=’OWNS’; the problem is that the filter is in the scope of the vehicle which doesn’t have a Type attribute. KEL allows you to pull data from anywhere on the path to a data element; but you have to state the place in the path you are pulling the data from. Hence: Person.PerVeh.Type=’OWNS’.

As an aside: KEL is a language under development, we do have an issue open to allow a rather prettier form:


Person: => nWholeOwned := COUNT(PerVeh(Type='OWNS').Veh(nOwners=1));

Which works if a filter is resolvable before the full path is known.

Putting this all together we have:


Person := ENTITY(FLAT(UID,Name,INTEGER Age));

USE File_Person(FLAT,Person);

Vehicle := ENTITY(FLAT(UID,Make,Colour=Color));
PerVeh := ASSOCIATION(FLAT(STRING Type,Person Per,Vehicle Veh));

USE File_Veh(FLAT,Vehicle);
USE File_PerVeh(FLAT,PerVeh);

Vehicle: => nOwners := COUNT(PerVeh(Type='OWNS'));
Person: => nWholeOwned := COUNT(PerVeh.Veh(Person.PerVeh.Type='OWNS',nOwners=1));

QUERY:Dump <= Vehicle{Make,Colour, nOwners,
                               PerVeh.Per{Vehicle.PerVeh.Type,UID,Name}};

QUERY:Dump1 <= Person{UID,Name, nWholeOwned,
                               PerVeh.Veh{UID,Make,Colour}};

Dump (a dump of vehicles with people as the children) yields:

Notice the use of the path walking syntax to annotate the child person records with the information from the link.

Dump1 is then rather simpler:

There is much more fun we can have with this data as we inch into some queries that are necessarily second order. First though I want to tackle the sub-graph matching problem; or rather point out that we have actually tackled it already.

Adventures in Graphland Series
Part I - Adventures in GraphLand (Part I)
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

JOINS...

Different Types of Joins


Matching records from multiple data sources is one of the fundamental operations you need to process data. As you would expect ECL makes it easy – but the number of options can be bewildering. The following aims to guide you through the different options, and explain when they are appropriate.

The "simplest" way to match records between two datasets is to pair each record in one dataset with each record in the other dataset, and then filter the pairs to extract the matches. This is what JOIN(,ALL) does. Unfortunately as well as being simple it is likely to be very slow unless the datasets are very small. If the first dataset has M records, and the second N, then it will typically execute in O(MN) time. Thankfully if the match condition contains any equalities then there are much better approaches.

One of these approaches is to sort both datasets by the equality fields, and then step through the two datasets in step. You have the overhead of sorting the datasets which is typically O(MlnM) + O(NlnN), and finding the matching pairs is typically O(M+N). This is generally much quicker than the exhaustive approach above, and is the normal approach used in Thor.

Roxie uses a different implementation by default – instead of sorting the two datasets it starts by creating a hash table from the second dataset. Then each row from the first dataset is looked up in the hash table to find the matches. With the current implementation the "right" dataset is still sorted (to simplify duplicate handling and many lookups), but the order of the left is preserved. If the right dataset is relatively small then this will be significantly faster than sorting both datasets. This type of join can be selected in Thor by adding a ",MANY LOOKUP" attribute.

If Roxie uses a LOOKUP join by default why doesn’t Thor? The limitation of a LOOKUP JOIN is that the entire right dataset needs to be held in memory. That is a reasonable assumption to make for Roxie, where the datasets are generally small, but it isn’t for Thor.

So far the discussion has assumed it is easy to compare a pair of rows, but it is only efficient if the rows are present on the same node in the cluster. Different methods are used in Thor to ensure the rows are present on the same node:

  • ALL joins and LOOKUP joins clone all the rows from the right dataset onto each of the nodes. This is only really efficient if the right dataset is reasonably small.
  • The sorted joins normally globally sort the first dataset, and use the distribution of records created from the first sort to distribute the second dataset to match.
  • An alternative is the HASH join. This distributes both datasets by a hash of the equality fields, and then performs local sorts after the datasets have been distributed.

If the ECL user knows already that the rows from the datasets must be on the same node, then they can opt to use a LOCAL variant of the join to avoid any redistribution. Sometimes, if the distribution of the left dataset is already known, the code generator will automatically distribute the right dataset to match, and then perform a local variant of the join.[+]

So in summary…

  • If the join conditional has no equalities then you need to use ALL.
  • If the right dataset is small you can use MANY LOOKUP.
  • Otherwise use a global sort.
  • You can use HASH to change the way distributions occur. (Sometimes this improves the performance.)
  • If matching rows are already on the same nodes you can add ,LOCAL.

But what if you don’t know how large the right dataset is? Using a standard join will work whatever the size of the datasets, but if the right is small then a lookup join would be significantly more efficient. One way to optimize for small datasets is to use conditional code based on an estimate of the size of the dataset. Unfortunately it makes the graphs ugly and confusing, and has to be pessimistic to avoid jobs failing because of insufficient memory. Why can’t it be done automatically? That’s where the new Thor SMART join comes in.

The SMART join processing code starts off assuming the right dataset is small, but gradually falls back to implementations that can cope with larger number of records. It goes through the following sequence:

  • Initially it used a MANY LOOKUP, duplicating the rows onto each node.
  • If it runs low on memory it uses a hash of the equality fields to distribute each row to a particular node. It then distributes the left dataset to match that distribution and uses a LOCAL MANY LOOKUP join on each node.
  • If there is still insufficient memory for the local hash table, the implementation falls back using a local join which sorts both sides on the re-distributed datasets. (Essentially a HASH JOIN).

There is little penalty from using a SMART join, but potentially a significant advantage for small datasets. If you are joining to a dataset of unknown size then it is worth experimenting by adding ",SMART" to your join. It is quite likely to become the default join implementation for Thor in a future version.

(Note, a SMART join can’t be the default implementation of a MANY LOOKUP join because a smart join may change the distribution of the left dataset . The code generator tracks the distribution of each dataset and use the information to optimize the query – those assumptions and optimizations would be broken if the dataset was redistributed.)


Pseudo Joins


There are few examples of ECL usage which are effectively joining two datasets together, but it is cunningly disguised...

  • An implicit dataset lookup:

    otherDataset(myfield=searchvalue)[1].x

    This occurs when a dataset is being searched for a field that matches a value, and then extracting a field from the matching row. If the dataset is small you are much better off using a new DICTIONARY support. Otherwise it would be much better to JOIN against the dataset (using a LEFT OUTER join if there may be no matching element, and KEEP(1) if there might be more than one).
  • Filtering by existance in a set:

    myDataset(searchValue IN SET(otherDataset, value))

    This code is searching to see if a value is contained in another dataset. Again if the dataset is small a dictionary is likely to be much more efficient. If it is large then an INNER JOIN (with KEEP (1)) would often be more appropriate.

Other JOIN options.


Sometimes most rows match relatively few rows in the other dataset, but there are a few instances of the match condition which result in very large numbers of records (e.g., john smith). Many of the options on a join are there to help cope with these pathological matches.

  1. LIMIT(n)

    If there are more than n matches for a particular row then abort the job.
    Note: LIMITs and ATMOSTS are only applied to the portion of the join condition that uses equalities. (Often called the hard match condition in some of the documentation.)

  2. LIMIT(n, SKIP)

    If there are more than n matches for a particular row exclude any matches for this row from the output.

  3. ATMOST(n)

    If there are more than n matches for a particular row the treat it as if it didn’t match anything.
    It is very similar to LIMIT(n, SKIP), but the difference is that for a LEFT OUTER join ATMOST will generate an output record. This allows you to return information about rows that had too many matches in the results.
    It is worth adding ATMOST(1) to a LEFT OUTER join that can only match 1 record. It will allow the code generator to remove the join if it isn’t needed.

  4. ATMOST({cond1, cond2, .., condn},n)

    If there are more than n matches for a particular row, apply each of the extra optional conditions until the number of matches is below the threshold.
    This allows you to apply a fairly broad join criteria for most pairs of rows, but add extra criteria if the equality fields are very common (e.g., some last names are much more common than others, so add extra conditions on first name, or other criteria). It is often used when the match condition also includes an edit distance match function to ensure the number of pairs processed by the edit distance function is reasonable.

  5. KEEP(n)

    Only generate output records for the first n rows that match the full join condition.

There are a few other options which are detailed in the language reference manual, which can help in fairly specific circumstances, but generally you should only add options for good reason.
(See http://hpccsystems.com/download/docs/ecl-language-reference)

[+] This new optimization has occasionally caused problems where ECL programmers have added a DISTRIBUTED() operation into some ECL to remove a warning, but the distribution expression wasn’t correct. They had used DISTRIBUTED(dataset, a-arbitrary-expression) instead of DISTRIBUTED(dataset) which indicates the dataset is DISTRIBUTED in an unspecified way. The moral of the story – never lie to the ECL compiler or it might come back to bite you.

Adventures in GraphLand (The Wayward Chicken)

The other evening I was distracted from a lengthy debug session by a commotion coming from the chicken coop. I raced out the door fearful that some local predator had decided to attempt to steal some dinner. I discovered no such thing; rather ‘the girls’ were all agitating to get out of the coop. In my concentration I had forgotten that it was past the time they are normally allowed out for their daily forage. I immediately opened the coop door and with an explosion of feathers the girls hurtled from the coop in all directions across the yard; each one individually convinced that the particular location that they had chosen to run to was liable to yield the most and tastiest bugs. Last out of the coop stalked our rooster with a grim look on his face; he would be spending the next hour trying to round his charges into the same place so that he could protect them from predators.

This tale of bucolic bliss is not provided to amuse; rather it illustrates the behavior of most programmers when first presented with a graph to analyze. Some deep primal instinct somehow persuades the otherwise analytic mind that the best way to tackle a graph is to scamper all over the place collecting snippets of information as one travels. One of our top technologists responded to my first blog with: “How do I get as far as possible as quickly as possible? Can I go backwards and forwards along the same path? What if I run around in circles? What if I get trapped? How do I know if I wander so far that I’m about to fall off the map?” Being a top technologist his language was rather more erudite (shortest path, directedness, cyclicity, connectedness and centrality) but the intuitive meaning is identical.

I don’t know what causes programmers to do this but I suspect it may be related to the (false) Urban Myth that everyone is related to Kevin Bacon by six degrees of separation. As part of a drastically over affirmed society our take-away from this myth is that there is yet another reason why we are special. I would suggest that if everyone is within six degrees of Kevin Bacon then the correct take-away should be that there is no significance to being within six degrees of Kevin Bacon.

Rather more mathematically; let’s assume that everyone has 25 close associates (relatives, co-workers, neighbors, in-laws) and as a first approximation let’s assume that the graph spans out evenly and without cycles. Then by six degrees of separation you have 25^6 = 240M people; around the adult population of the US. If I were to say: “There are two million criminals in the US, you are in the US, therefore you might be criminal” people would think I were crazy. If I were to say that I had computed that two million of your sixth degree relatives were criminal; then you would be in for questioning. The reality is that the two statements are essentially the same.

Even if we narrow our sites far further, if you view ‘relatives’ as a social network (25 people) rather than pure blood (3-9 people) then by third degree relationships you have fifteen thousand people; you would expect around 150 of them to have done jail time simply by random chance. At a statistical level if your graph has a branching factor of 25 then a fact regarding a second degree relative is twenty five times less significant than one relating to a first degree relative.

As such: the best and most significant information is always close to home.

This insight is one of the core drivers behind the KEL programming philosophy. Whilst some graph languages proudly declare they are ‘graph traversal’ languages; KEL is a knowledge engineering language. Specifically:

KEL exists to concentrate knowledge from the graph into the entity (not to scatter the query from the entity to the graph).

To illustrate the difference I want to tackle a question for the data presented in the last blog: “how many of your first degree relatives are the youngest person they know?” To make this easier to visualize here is the graph:

To answer the given question for ‘David’ here is a plausible graph traversal approach:

If you execute the above you will visit the following nodes in sequence:

So a query about David causes us to visit 15 nodes. As an exercise, you should try writing out the nodes visited if I wanted to know the average number of youngest relatives for the two oldest people in the graph. You should visit 29 nodes.

The KEL methodology is rather different:

Pleasingly it is rather easier to read in KEL than it is in pseudo-code!


Person: => Youngest := COUNT(Relationship(whoelse.Age<Age))=0;
Person: => YoungRels := COUNT(Relationship.whoelse(Youngest));

This is not a language tutorial but there are a couple of aspects of the above worth noting:

  • The ‘Person:’ defines the scope of the operation. It applies to Person nodes.
  • The space between the ‘:’ and the ‘=>’ allows a precondition for the operation to apply; that will be a later subject.

    The interesting piece occurs after the ‘:=’ so we will look at the first in detail:

    • For a given entity all of the associations are available as if they were a child dataset with the label of the association.
    • Implicitly, when in the scope of an entity, the references to any associations fix the first matching entity in the association to self.

    So,

    COUNT(Relationship)

    in the person scope would give you the number of relationships for each person.

    For this particular computation we want to filter those associations further, to those in which the OTHER person in the relationship is younger. We do this using: (whoelse.Age<Age). If there is no-one who is younger than you are then you are declared to be the youngest.

    Subsequent to this line every Person node has a new property available ‘Youngest’ which can be queried as used as any other property. Thus in the second line, we can compute the number of youngest relatives simply by counting and using the Youngest property as a filter.

    The nodes visited to compute the query for David are:

    Er, which is 22! That is 7 more visits that the normal method! Why would I go to such pains to make my life harder? Well, now trace through how many extra visits do I have to make to compute the average of the two oldest people? The total should be 26; a savings of 3. Now suppose I want to know the average number of youngest relatives across the whole graph; suddenly the KEL solution begins to pull substantially ahead.

    So what is happening? In the traditional model, each computation for each entity moves away from the entity to the first degree relatives which in turn have to move to their first degree relatives to compute the answer. Thus most of the computations are performed at the second degree from the original entity being considered.

    If viewed programmatically, within the KEL model this second degree query is written as two subsequent first degree queries. Of course, the KEL compiler may or may not choose to execute the query that way, and it will almost certainly have different strategies for thor and roxie; but logically that is how it is specified.

    We should not however, view KEL programmatically. Rather we should view KEL, or at least the KEL logic statements, as a graph annotation model, or better a knowledge concentration model. With each logic statement, knowledge that could be gleaned by traveling away from a node is moved and stored within the node itself. Queries then execute against the knowledge rich graph that is produced.

    There are, of course, examples when a graph query cannot be decomposed into a succession of simpler queries. Those are slightly harder to code and significantly harder to execute; they will be the subject of a later blog. For now just keep in mind; if you can code it as a succession of near-relative encounters then you should do so. In the mean time we need to tackle heterogeneity; what do you do if not all of your entities are the same type? That will be the subject of our next blog.

    In closing I would like to present our full ‘program’ as we have it so far; and the answers:


    Person := ENTITY(FLAT(UID,Name,INTEGER Age));
    Relationship := ASSOCIATION(FLAT(Person who,Person whoelse));

    USE File_Person(FLAT,Person);

    USE File_Relat(FLAT,Relationship(who=person1,whoelse=person2)
                                 ,Relationship(who=person2,whoelse=person1));

    Person: => Youngest := COUNT(Relationship(whoelse.Age<Age))=0;
    Person: => YoungRels := COUNT(Relationship.whoelse(Youngest));

    QUERY:Dump1 <= Person{UID,Name,Age,Youngest,YoungRels
                                 ,Relationship.whoelse{Name,Age,Youngest}};

    Produces:

    Incidentally: if anyone has the time and inclination to tackle these queries in alternate graph languages I would be very interested to see the results. I would also love to see if someone wants to have a swing at this in ECL; it will help me keep the KEL compiler writer on its toes.

    Adventures in Graphland Series
    Part I - Adventures in GraphLand (Part I)
    Part II - Adventures in GraphLand (The Wayward Chicken)
    Part III - Adventures in GraphLand III (The Underground Network)
    Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
    Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

    Tips and Tricks for ECL -- Part 1 -- Bit Fiddling

    I get fairly frequent emails asking, "How can I ... (do something) in ECL?" I've saved the example code from many of these questions, so now I'll begin sharing them through this blog.

    If anybody has this type of question they'd like me to answer (in the blog), please just send me an email to richard.taylor@lexisnexis.com and I will do my best to come up with an ECL example demonstrating the solution.

    All the code examples in this series will be stored in the "BlogCode" directory of my repository. So you will see IMPORT BlogCode used frequently.

    We'll start now with several questions I've gotten about "bit fiddling" in ECL.

    A Couple of Useful Functions

    The first thing to know about "bit fiddling" in ECL is that the bitwise operators will all only work on integer data types. Therefore, all the following examples are all designed to work with an UNSIGNED4 bitmap, although you can certainly use them with smaller bitmaps if you need to.

    We'll start with a couple of functions I wrote that are generally useful when you're working with bits:

    EXPORT Bit(UNSIGNED4 Bitmap,UNSIGNED1 WhichBit) := MODULE
      SHARED Mask := (UNSIGNED4)POWER(2,WhichBit-1);
    
      EXPORT BOOLEAN   IsOn := BitMap & Mask <> 0;
    
      EXPORT UNSIGNED4 Flip := Bitmap ^ Mask;
    
    END;

    I created a MODULE structure to contain these, since they are related functions and both take the same two parameters. They also both use the same expression, so I was able to simplify the code by defining the SHARED Mask definition used by both EXPORT functions.

    For the SHARED Mask definition, the POWER function raises the base value (2) by the exponent defined by WhichBit-1. This makes use of the mathematical fact that an exponent of zero is always 1, an exponent of one always returns the base number, and every other exponent will create the appropriate value so that the Mask (as a decimal value) will always be the value 1, or 2, or 4, or 8, or 16, etc. so that the only bit turned on in the resulting UNSIGNED4 Mask is the one bit in the ordinal position specified by the WhichBit parameter.

    The IsOn function returns a BOOLEAN indicating whether the specified bit is a 1 (on) or 0 (off). It starts by detecting an empty bitmap and immediately returning FALSE in that case, otherwise it will use the Bitwise AND operator (&) with the passed in Bitmap and the Mask to determine if the specified bit is on or not.

    The Flip function returns an UNSIGNED4 as the new bitmap after it has flipped the value of the specified bit. If the passed in bitmap as a 1 (on) in that bit it will change it to 0 (off), and vice versa. It uses the Bitwise eXclusive OR operator (^) with the passed in Bitmap and the Mask to flip just the value of the specified bit, leaving all other bits as they were.

    Here's an example of how these are used. This can be run in any builder window:

    IMPORT BlogCode;
    UNSIGNED4 MyBitmap := 4; //or in binary -- 00000000000000000000000000000100b
    
    BlogCode.Bit(MyBitmap,1).IsOn;  //returns FALSE
    BlogCode.Bit(MyBitmap,2).IsOn;  //returns FALSE
    BlogCode.Bit(MyBitmap,3).IsOn;  //returns TRUE
    
    NewBitMap := BlogCode.Bit(MyBitmap,2).Flip; //turn on second bit 
      //making this -- 00000000000000000000000000000110b
    
    BlogCode.Bit(NewBitMap,1).IsOn;  //returns FALSE
    BlogCode.Bit(NewBitMap,2).IsOn;  //returns TRUE
    BlogCode.Bit(NewBitMap,3).IsOn;  //returns TRUE

    ADDENDUM

    When David Bayliss reviewed the above code, he said that I had "just inserted a floating point operation into low-level bit twiddling. This is HORRIBLE." So he suggested an alternative method that just uses the bitshift left operator (<<) along with the Bitwise AND (&) and Bitwise OR (|) operators, like this:

    EXPORT Bit(UNSIGNED4 Bitmap,UNSIGNED1 WhichBit) := MODULE
    
      EXPORT BOOLEAN   IsOn :=   Bitmap & (UNSIGNED4)(1 << WhichBit-1) <> 0;
    
      EXPORT UNSIGNED4 Flip := Bitmap | ((UNSIGNED4)1 << WhichBit-1);
    
    END;

    This method removes the need for that POWER function.

    So What Can You Do With Bitmaps?

    Someone emailed me with this request:

    "Thought you might have a trick up your sleeve to attack this type of encoding. Basically an integer value is used to represent a list of values where each bit represents a separate item.

    "In the example below, each bit set represents an index into a string to get yet another code.

    // Valid codes  " VURQLKGFEDCABIJOWY",   
    // Position      1234567890123456789
    // Number                  position              Code
    //   32 = 2^5,              5  + 1                L
    // 4096 = 2^12              12 + 1                A
    //  512 = 2^9               9  + 1                E
    //  544 = 2^5 + 2^9                               LE

    "How do we write ECL code to translate a number to a code?"

    So here's the first code I wrote to handle this specific problem:

    IMPORT $;
    
    EXPORT Bit2Code(UNSIGNED4 B, STRING32 Codes) := FUNCTION
    
    		Code32 := IF($.Bit(B,32).IsOn,Codes[32],'');
    		Code31 := IF($.Bit(B,31).IsOn,Codes[31],'');
    		Code30 := IF($.Bit(B,30).IsOn,Codes[30],'');
    		Code29 := IF($.Bit(B,29).IsOn,Codes[29],'');
    		Code28 := IF($.Bit(B,28).IsOn,Codes[28],'');
    		Code27 := IF($.Bit(B,27).IsOn,Codes[27],'');
    		Code26 := IF($.Bit(B,26).IsOn,Codes[26],'');
    		Code25 := IF($.Bit(B,25).IsOn,Codes[25],'');
    		Code24 := IF($.Bit(B,24).IsOn,Codes[24],'');
    		Code23 := IF($.Bit(B,23).IsOn,Codes[23],'');
    		Code22 := IF($.Bit(B,22).IsOn,Codes[22],'');
    		Code21 := IF($.Bit(B,21).IsOn,Codes[21],'');
    		Code20 := IF($.Bit(B,20).IsOn,Codes[20],'');
    		Code19 := IF($.Bit(B,19).IsOn,Codes[19],'');
    		Code18 := IF($.Bit(B,18).IsOn,Codes[18],'');
    		Code17 := IF($.Bit(B,17).IsOn,Codes[17],'');
    		Code16 := IF($.Bit(B,16).IsOn,Codes[16],'');
    		Code15 := IF($.Bit(B,15).IsOn,Codes[15],'');
    		Code14 := IF($.Bit(B,14).IsOn,Codes[14],'');
    		Code13 := IF($.Bit(B,13).IsOn,Codes[13],'');
    		Code12 := IF($.Bit(B,12).IsOn,Codes[12],'');
    		Code11 := IF($.Bit(B,11).IsOn,Codes[11],'');
    		Code10 := IF($.Bit(B,10).IsOn,Codes[10],'');
    		Code09 := IF($.Bit(B,09).IsOn,Codes[09],'');
    		Code08 := IF($.Bit(B,08).IsOn,Codes[08],'');
    		Code07 := IF($.Bit(B,07).IsOn,Codes[07],'');
    		Code06 := IF($.Bit(B,06).IsOn,Codes[06],'');
    		Code05 := IF($.Bit(B,05).IsOn,Codes[05],'');
    		Code04 := IF($.Bit(B,04).IsOn,Codes[04],'');
    		Code03 := IF($.Bit(B,03).IsOn,Codes[03],'');
    		Code02 := IF($.Bit(B,02).IsOn,Codes[02],'');
    		Code01 := IF($.Bit(B,01).IsOn,Codes[01],'');
    	  RETURN TRIM(Code01 + Code02 + Code03 + Code04 + Code05 
    	            + Code06 + Code07 + Code08 + Code09 + Code10 
    	            + Code11 + Code12 + Code13 + Code14 + Code15 
    	            + Code16 + Code17 + Code18 + Code19 + Code20 
    	            + Code21 + Code22 + Code23 + Code24 + Code25 
    	            + Code26 + Code27 + Code28 + Code29 + Code30 
    	            + Code31 + Code32,ALL);
    END;

    This function takes two parameters: and UNSIGNED4 bitmap, and a STRING32 containing the valid codes for each position. This is a simple "brute force" approach that will test each bit in the bitmap and either take the code value for that position or a blank string. The TRIM function is using the ALL option to remove all spaces from the concatenated result string.

    You can test the function in a builder window like this:

    IMPORT BlogCode;
    
    ValidCodes := ' VURQLKGFEDCABIJOWY';
    
    output(BlogCode.Bit2Code(32,ValidCodes),named('Bit2Code_32'));		  //L
    output(BlogCode.Bit2Code(4096,ValidCodes),named('Bit2Code_4096'));	//A
    output(BlogCode.Bit2Code(512,ValidCodes),named('Bit2Code_512'));		//E
    output(BlogCode.Bit2Code(544,ValidCodes),named('Bit2Code_544'));		//LE
    output(BlogCode.Bit2Code(10000000000000000b,ValidCodes),named('Bit2Code_10000000000000000b'));	//O
    output(BlogCode.Bit2Code(10000000010000000b,ValidCodes),named('Bit2Code_10000000010000000b'));	//GO

    Once I had solved his specific problem, I decided to expand the solution to be a more generic tool. Instead of using the bitmap to indicate a set of single letter codes, why not have it indicate a set of strings of any length (up to 4K, in this example)? That's what this next version does:

    IMPORT $;
    
    EXPORT Bit2String  := MODULE
      EXPORT Layout := RECORD
        STRING Txt{MAXLENGTH(4096)};
      END;
    
      EXPORT Func(UNSIGNED4 B, DATASET(Layout) Codes) := FUNCTION
    
        br := ROW({''},Layout);
    
    		Code32 := IF($.Bit(B,32).IsOn,Codes[32],br);
    		Code31 := IF($.Bit(B,31).IsOn,Codes[31],br);
    		Code30 := IF($.Bit(B,30).IsOn,Codes[30],br);
    		Code29 := IF($.Bit(B,29).IsOn,Codes[29],br);
    		Code28 := IF($.Bit(B,28).IsOn,Codes[28],br);
    		Code27 := IF($.Bit(B,27).IsOn,Codes[27],br);
    		Code26 := IF($.Bit(B,26).IsOn,Codes[26],br);
    		Code25 := IF($.Bit(B,25).IsOn,Codes[25],br);
    		Code24 := IF($.Bit(B,24).IsOn,Codes[24],br);
    		Code23 := IF($.Bit(B,23).IsOn,Codes[23],br);
    		Code22 := IF($.Bit(B,22).IsOn,Codes[22],br);
    		Code21 := IF($.Bit(B,21).IsOn,Codes[21],br);
    		Code20 := IF($.Bit(B,20).IsOn,Codes[20],br);
    		Code19 := IF($.Bit(B,19).IsOn,Codes[19],br);
    		Code18 := IF($.Bit(B,18).IsOn,Codes[18],br);
    		Code17 := IF($.Bit(B,17).IsOn,Codes[17],br);
    		Code16 := IF($.Bit(B,16).IsOn,Codes[16],br);
    		Code15 := IF($.Bit(B,15).IsOn,Codes[15],br);
    		Code14 := IF($.Bit(B,14).IsOn,Codes[14],br);
    		Code13 := IF($.Bit(B,13).IsOn,Codes[13],br);
    		Code12 := IF($.Bit(B,12).IsOn,Codes[12],br);
    		Code11 := IF($.Bit(B,11).IsOn,Codes[11],br);
    		Code10 := IF($.Bit(B,10).IsOn,Codes[10],br);
    		Code09 := IF($.Bit(B,09).IsOn,Codes[09],br);
    		Code08 := IF($.Bit(B,08).IsOn,Codes[08],br);
    		Code07 := IF($.Bit(B,07).IsOn,Codes[07],br);
    		Code06 := IF($.Bit(B,06).IsOn,Codes[06],br);
    		Code05 := IF($.Bit(B,05).IsOn,Codes[05],br);
    		Code04 := IF($.Bit(B,04).IsOn,Codes[04],br);
    		Code03 := IF($.Bit(B,03).IsOn,Codes[03],br);
    		Code02 := IF($.Bit(B,02).IsOn,Codes[02],br);
    		Code01 := IF($.Bit(B,01).IsOn,Codes[01],br);
    	  RETURN (Code01 + Code02 + Code03 + Code04 + Code05 
    	            + Code06 + Code07 + Code08 + Code09 + Code10 
    	            + Code11 + Code12 + Code13 + Code14 + Code15 
    	            + Code16 + Code17 + Code18 + Code19 + Code20 
    	            + Code21 + Code22 + Code23 + Code24 + Code25 
    	            + Code26 + Code27 + Code28 + Code29 + Code30 
    	            + Code31 + Code32)(Txt <> '');
      END;   
    END;

    This code uses the same "brute force" approach, but instead of building characters for a string, it defines the records that will go into the result record set. That means that instead of using the TRIM function to get rid of blank spaces, we simply append all the result records into a record set that we filter to eliminate the blank records.

    Testing this version is similar to the previous, but we pass the dataset of text values as the second parameter. Multiple bits turned on result in multiple records in the result set.

    IMPORT BlogCode;
    SetStrings := ['','The Voice','American Idol','The X Factor','Law & Order',                  
                   'Lost','Glee','The Daily Show','Revenge','Hart of Dixie',
                   'Walking Dead','True Blood','Sopranos','Game of Thrones',
                   'Downton Abbey','Poirot','Rizzoli & Isles','Suits','Swamp People',
                   'Pawn Stars','Firefly','AC 360','Fox & Friends','Hardball',
                   'Mike & Molly','60 Minutes','The Ellen Show','Elementary','Sherlock',
                   'Here Comes Honey Boo Boo','Doctor Who'];
    ds := DATASET(SetStrings,BlogCode.Bit2String.Layout);
    
    OUTPUT(BlogCode.Bit2String.Func(110011b,ds));		
    OUTPUT(BlogCode.Bit2String.Func(64,ds));		
    OUTPUT(BlogCode.Bit2String.Func(64928,ds));		
    OUTPUT(BlogCode.Bit2String.Func(0100000000000000000000000000000b,ds));

    Bitmapped Dates

    I was teaching an ECL class one day in Alpharetta, and we were having a discussion of the various ways you can store dates in ECL. Joe Keim came up with this solution.

    This function will squeeze a standard YYYYMMDD 8-byte date string into three bytes of storage in an UNSIGNED3 bitmap. This format has the added advantage that it will still be perfectly sortable, just as a YYYYMMDD string would be.

    EXPORT UNSIGNED3 Date2Bits(STRING8 YYYYMMDD) := FUNCTION
      YY := (UNSIGNED3)YYYYMMDD[1..4] << 9;
    	MM := (UNSIGNED3)YYYYMMDD[5..6] << 5;
    	DD := (UNSIGNED3)YYYYMMDD[7..8];
      RETURN YY | MM | DD;
    END;

    This Date2Bits function takes the first four characters of the string (YYYY) and casts that value into an UNSIGNED3 that is then shifted left 9 bits. The next two characters of the string (MM) are also cast into an UNSIGNED3 that is then shifted left 5 bits. The last two characters of the string (DD) are simply cast into an UNSIGNED3.

    Those three UNSIGNED3 values are then ORed together using the Bitwise OR operator (|) to create the resulting bit map, where:

    YYYY is in first 15 bits
    MM is in next 4 bits
    DD is in last 5 bits

    You can then use this Bits2Date function to reconstruct the YYYYMMDD string from the bitmap:

    EXPORT STRING8 Bits2Date(UNSIGNED3 d) := FUNCTION
      STRING4 YY := INTFORMAT((d & 111111111111111000000000b >> 9),4,0);
      STRING2 MM := INTFORMAT((d & 000000000000000111100000b >> 5),2,1);
      STRING2 DD := INTFORMAT((d & 000000000000000000011111b),2,1);
      RETURN YY + MM + DD;
    END;

    This function uses the Bitwise AND operator (&) against a binary mask, which is then shifted right 9 bits to produce the integer value for the YYYY. The INTFORMAT function then right-justifies that integer value into a 4-byte string with leading blanks. The MM and DD values are treated the same way, except their STRING2 results are formatted with leading zeroes. The final YYYYMMDD string is a simple concatenation of the three.

    You can test these functions with code like this in a builder window:

    IMPORT BlogCode;
    d1 := '19500827';
    
    UNSIGNED3 date1 := BlogCode.Date2Bits(d1);
    date2 := BlogCode.Bits2Date(date1);
    
    ds2 := DATASET([{'InDate',d1},
                    {'OutBitVal',Date1},
                    {'Re-Date',Date2}],
                   {string10 txt,string10 val});
    							 
    OUTPUT(ds2,NAMED('Bits2Date_' + d1));

    Viewing the Bitmap

    Sometimes you just want to see what the bitmap actually looks like. You can, of course, just output the integer value, but a bitmap is best viewed in binary representation. So when Peter Vennel needed to create a string representation of a bitmap, here's my code to do that:

    IMPORT $;
    
    EXPORT STRING32 Bit2Str(UNSIGNED4 Bin) := FUNCTION 
    
      STRING1 Bin2Str(UNSIGNED1 Bit) := IF($.Bit(Bin,Bit).IsOn,'1','0');
    
      RETURN Bin2Str(32) + Bin2Str(31) + Bin2Str(30) + Bin2Str(29) + 
             Bin2Str(28) + Bin2Str(27) + Bin2Str(26) + Bin2Str(25) + 
             Bin2Str(24) + Bin2Str(23) + Bin2Str(22) + Bin2Str(21) + 
             Bin2Str(20) + Bin2Str(19) + Bin2Str(18) + Bin2Str(17) + 
             Bin2Str(16) + Bin2Str(15) + Bin2Str(14) + Bin2Str(13) + 
             Bin2Str(12) + Bin2Str(11) + Bin2Str(10) + Bin2Str(9) + 
             Bin2Str(8)  + Bin2Str(7)  + Bin2Str(6)  + Bin2Str(5) + 
             Bin2Str(4)  + Bin2Str(3)  + Bin2Str(2)  + Bin2Str(1); 
    END;

    This simple function just relies on the previously defined Bit.IsOn function to generate either a "1" or "0" character in each bit position. The result is just the concatenation of all 32 characters, showing you exactly what the bitmap looks like.

    Test with this code in a builder window and you'll see the result:

    IMPORT BlogCode;
    
    val := 532860;
    
    OUTPUT(BlogCode.Bit2Str(6));   //00000000000000000000000000000110
    OUTPUT(BlogCode.Bit2Str(val)); //00000000000010000010000101111100

    Convert to Hexadecimal

    I got an email from Jarvis Robinson, who needed a way to display text and the corresponding Hexadecimal values side by side. So I wrote this String2HexString function to do that:

    EXPORT String2HexString(STRING DataIn) := FUNCTION
    
      STRING2 Str2Hex(STRING1 StrIn) := FUNCTION
        STRING1 HexVal(UNSIGNED1 val) := 
    		      CHOOSE(val,'1','2','3','4','5','6','7','8',
    				 '9','A','B','C','D','E','F','0');
        UNSIGNED1 Char1 := (>UNSIGNED1<)StrIn >> 4;
        UNSIGNED1 Char2 := ((>UNSIGNED1<)StrIn & 00001111b);
        RETURN HexVal(Char1) + HexVal(Char2);
      END;
    	
      Rec := {STRING Hex{MAXLENGTH(1024)}}; 
      ds := DATASET(LENGTH(TRIM(DataIn)),
                    TRANSFORM(Rec,
      	            SELF.Hex := Str2Hex(DataIn[COUNTER])));
      HexOut := ROLLUP(ds,TRUE,TRANSFORM(OutRec,SELF.Hex := LEFT.Hex + RIGHT.Hex));
      RETURN DATASET([{DataIn,HexOut[1].Hex}],
                     {STRING Txt{MAXLENGTH(1024)},
                      STRING Hex{MAXLENGTH(1024)}});
    END;

    This function starts with the Str2Hex function nested within, which will produce two hexadecimal characters for a single string character. It has its own HexVal function that returns a the Hex for a given value. The Char1 definition uses the (>UNSIGNED1<) type transfer shorthand syntax to treat the StrIn character as an UNSIGNED1 value, so that the Bitshift right operator (>>) will work (all bitwise operations can only be done on integer types). This simply moves the left 4 bits to the righthand nibble, making the UNSIGNED1 a value from zero through 15. The Char2 definition also uses the (>UNSIGNED1<) type transfer shorthand syntax to treat the StrIn character as an UNSIGNED1 value, so that the Bitwise AND operator (&) will work. This simply removes the left 4 bits, making the UNSIGNED1 a value from zero through 15. The return value concatenates the two hex characters into the STRING2 result.

    The key to this function being able to work with any size string (up to 512 bytes, as written, but you can modify that if needed), is the DATASET definition. This form of DATASET creates a new dataset with as many records as there are characters in the StrIn parameter. More importantly, the inline TRANSFORM populates each record with the Hexadecimal equivalent to each chacter in the StrIn.

    That leaves only the ROLLUP to composite all those records into a single one with all the Hex values concatenated into a single string. The inline form of DATASET then allows a nicely formatted one-record result to present both the input text and its hex equivalent as the result.

    Testing the code this way will show its usefulness:

    IMPORT BlogCode;
    STRING dat := 'efgh\tBCD'; //x'6566676809424344'
    
    OUTPUT(BlogCode.String2HexString(dat));

    Note that the input string contains a tab (\t) character, which can be difficult to see in text mode. But looking at the result in hex mode, you can clearly see the 09 hex representation of the tab.

    That's enough for this article. We'll continue with more useful code examples in subsequent articles.

    Adventures in GraphLand (Part I)

    Well over a decade ago, I had the privilege of being one of the first programmers to use ECL. A new language at a new level of abstraction allowed us to think about data absent of process. We were no longer working in the realm of time and method, but in the realm of data. We wanted a name for this place that our minds could now wander; so we called it ‘Dataland’. Since then, many hundreds of programmers have found their way through Dataland; usually heading onto our bigger, stronger but possibly less romantic environments. A few linger to test out new ideas, new platform features or better ways to do things we have done in the past; but ultimately the dream of being able to play with data is no longer a dream: It is the reality for hundreds of people every day.

    The question naturally arises, at least to those of us with very short attention spans, what if we could abstract ourselves a level further? What if we could leave behind the world of tables and joins and instead operate in a world where the knowledge just ‘was’? Where questions could be asked and answered; knowledge derived and, somehow, all the data just fell into line? That is the promise of KEL the Knowledge Engineering Language being produced by Eric Blood, Sr Consulting Software Engineer, and his team.

    Once again, I am in the privileged position of being among the first programmers to use KEL. As I was hammering out some of my first programs, often with the cursing and mutterings of someone that is well beyond laceration edge, I realized that the language was changing the way I thought about the data. This was not just a way to navigate Dataland faster; I was entering a new place: GraphLand.

    Most of my early travails in Dataland have been lost in the sands of time, a stream of adrenaline or the fog of amnesia. As the situation is now (slightly) more leisurely, I thought it might be worth recording some of my early adventures in GraphLand with the hope of paving the way for those that follow.

    Naturally big data gets exciting because it is big and real-world; Jo Prichard, Sr Architect and Data Scientist, is exploring some big and very interesting datasets housed within LexisNexis, as we speak. I have given myself a rather different task; I am creating and using a micro-dataset so that I can hand-compute, test and visualize exactly what is happening. My experience is that things that are understood in microcosm can then be intuited at scale.

    Sadly, before GraphLand can be truly explored, a bridge needs to be built from Dataland to GraphLand. Much like the cooking shows, where they suddenly announce: “here is one I prepared earlier”, many writings on graph systems assume that somehow the graph magically sprung forth into existence. In contrast, I aim to spend the rest of this blog describing one such bridge (albeit a simple one).

    We are going to start with two datasets; one a simple table of people, and the other one a table indicating relationships between them. In our Dataland model we have given all of our entities unique identifiers (indeed, a huge amount of the work done in Dataland is precisely to manufacture these unique identifiers). For this micro-dataset I have allocated nice small numbers by hand.

    People:

    Relationships:

    Within KEL I now need to do two things; tell KEL what my GraphLand entities will look like; and tell KEL where in Dataland to find them. For such a simple data model the first is almost trivial:

    Person := ENTITY(FLAT(UID,Name,INTEGER Age));
    Relationship := ASSOCIATION(FLAT(Person who, Person whoelse));
    

    We declare that there are people, with three properties, a unique ID, a name and an age. We then declare that relationships are associations between two people. That’s the GraphLand model done.

    For the people; telling KEL where to find the Dataland version of the file is equally easy:

      USE File_Person(FLAT,Person);
    

    File_Person is the Dataland attribute containing my microfile; and we are telling KEL to extract people from it. The slightly harder case is the relationships. In my Dataland file I have one record for a relationship between two people. In GraphLand I want to represent the fact that either person can be the primary (who) person in that relationship. This is handled in the USE statement:

    USE File_Relat(FLAT,Relationship(who=person1,whoelse=person2)
                       ,Relationship(who=person2,whoelse=person1));
    

    You can pull multiple entities and associations from a single record of a file. Some of our more complex records (such as a vehicle sale) might have half a dozen entities and relationships in one record, although the precise nature of the linkages is normally encapsulated entirely within people’s minds or within labels upon fields. This implicit structure in dataland is rendered explicit as the Dataland data is bridged into Graphland.

    Having brought the data in it, it is usually worth ensuring that nothing got lost; the easiest way is to get GraphLand to do a datadump.

     QUERY:Dump <= Person,Relationship;
    

    Produces

    If you look at the data; not much has changed (this is good). Each field has gained an __flags column. As things get more complex, KEL will start to use those flags to capture meta-data information about each field value that is computed. There is also a __recordcount to tell us how many Dataland records attest each fact. The final thing to notice is that we have twice as many relationships due to our use of the two USE statements.

    Logically I should now declare success and finish this entry; however I don’t want to leave without giving some small inkling of why we would bother to have done all this work. So here is a modified version of the KEL query, in which I use the relationships to give myself a record in which my relations are listed.

    QUERY:Dump1 <= Person(UID,Name,Age,{Relationship.whoelse.Name}};
    

    Produces:

    For those of you back in the land of tables and joins, producing the table above from the raw data requires two joins and a project (minimum). Here in the GraphLand things are rather different.

    Adventures in Graphland Series
    Part I - Adventures in GraphLand (Part I)
    Part II - Adventures in GraphLand (The Wayward Chicken)
    Part III - Adventures in GraphLand III (The Underground Network)
    Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
    Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

    Why SKIPing can be bad for your query

    The most common way of filtering out records is to apply a filter to a dataset. E.g.,

    Centenarians := myDataset(age >= 100)
    

    However, sometimes the easiest place to decide whether or not a record is required is within the transform that creates it. ECL has the SKIP keyword to indicate that a record shouldn’t be generated from a transform1. SKIP has two different syntaxes:

    1. SKIP as an attribute on a transform

    outRecord process(inRecord l) := TRANSFORM,SKIP(l.age >= 100)
    …
    

    Here the transform doesn’t generate a record if the skip condition is true.

    2. SKIP as a side-effect.

    outRecord process(inRecord l) := TRANSFORM
       somethingComplex := f(l);
       SELF.id := IF(somethingComplex, l.id, SKIP);
       SELF.next := CASE(l.kind, 1=>l.x, 2=>l.y, 3=>l.z, SKIP);
    …
    

    Here the transform doesn’t generate a record if the SKIP expression is evaluated. This is often used when the expression being tested is complex, and the output record is being rejected as an exception.

    However this form also comes with a cost. The problem is that because the assignment to the field next has the side-effect of potentially skipping, the code generator cannot optimize the field away if it isn’t subsequently used.

    Both forms of SKIP can also prevent the transform from being merged with subsequent activities2.

    So what are the alternatives?

    For the situations where nothing else will do (e.g., SKIP on a JOIN transform), it is much better to use the 1st form than the second – because it doesn’t prevent output fields being optimized away. But what if the filter condition is something complex and you don’t want to duplicate the ECL? You can continue to share common definitions by defining a function that returns a transform. E.g.,

    process(inRecord l) := FUNCTION
       somethingComplex := f(l);
       outRecord t := TRANSFORM, SKIP(NOT somethingComplex)
          SELF.id := l.id;
          …
       END;
       RETURN t;
    END;
    

    You can also implement something similar with a complex test condition - if you can use an exception value for the case that will be skipped:

    process(inRecord l) := FUNCTION
       nextValue := CASE(l.kind, 1=>l.x, 2=>l.y, 3=>l.z, 0); // 0 isn’t otherwise a valid value.
       outRecord t := TRANSFORM, SKIP(nextValue = 0)
          SELF.next := nextValue;
          …
       END;
       RETURN t;
    END;
    

    For a PROJECT you are probably better off removing the SKIP from the transform, and filtering the resulting dataset instead. That will allow the filter condition to potentially be migrated through the query (skips are not currently migrated outside a transform).

    In summary… If you need to use SKIP try and use the first form. If you can replace SKIP with a filter you are likely to end up with even better code.

    (1) In some activities like ITERATE, SKIP can have the subtly different meaning of don’t process this record.
    (2) A couple of reasons why: If the activities are different then SKIP may mean something different, and when SKIP is used as a side-effect, combining transforms may cause the SKIP to be lost.

    Contact Us

    email us   Email us
    Toll-free   US: 1.877.316.9669
    International   Intl: 1.678.694.2200

    Sign up to get updates through
    our social media channels:

    facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

    Get Started