Adventures in GraphLand IV (The Subgraph Matching Problem)
I once opened a meeting of extremely senior folks all of whom specialized in large scale graph analytics by stating: “There is no such thing as an interesting large scale graph problem; but there are lots of huge graphs containing interesting small problems that we can’t find.” My statement was clearly hyperbolic and exceptions do exist although I would contend it is rather closer to the truth than most people would like to admit. For reasons covered under ‘The Wayward Chicken’ a huge highly connected graph begins to look a lot like a population average. Therefore if our graph analytics are really going to buy us anything we need to be able to narrow down the area of the graph across which we are working.
One of the most important graph narrowing mechanisms is the Subgraph Isomorphism problem or more informally “subgraph matching” problem. The core idea here is that I have a little graph in my hand of interconnected facts; I want to find a place (or places) in a much larger graph where that little graph appears. This seemingly simple ability is the engine behind our patented MEOW technology that maps fragments of entities in documents to social networks. It is the same idea that drives our advanced person search capabilities. In short if there was a single graph problem that we cared about being able to do well; it would be this one. This leads me neatly to the second major feature of the Subgraph Matching problem; the general case is NP complete. Put another way, it is excruciatingly difficult to tackle this problem well.
This all leads us to the obvious question; why have we progressed this far through our tour of dataland without touching upon the subgraph matching problem? The answer is twofold. Firstly was my slightly weaselly use of the phrase: ’in the general case.’ Whilst some sub-graph matching problems are horrible some are relatively benign. I believe that through judicious precomputation and intelligent query design it is possible to render many, many subgraph matching problems tractable. Secondly KEL is designed so that the “subgraph matching problem” (and most other graph queries) just ‘drop out’ once the logic elements have been defined. The purpose of the rest of this blog entry is to persuade you that this is the case.
For the following we will use the datasets defined in the previous three blogs. To start simply: “Find all the people that own or drive a Honda”:
QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') ) );
Here the outer level of our filter expression is ‘Person’; so we have already confined the search to person nodes. You can think of the ‘inside’ of a filter condition as: “go to the node and then try…” So for each Person node we will walk down PerVeh link if there is one, then follow the Veh entity and check if it is a Honda. A quick review of the Person->Vehicle table we produced in the last blog shows that the answer should be Helen and Matthew; running the code produces:
Suppose we decide we want to hunt for people that own or drive a Honda AND a car which is blue. A quick glance at the above code probably tells you that:
QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') ) ,EXISTS(PerVeh.Veh(Colour='BLUE') ) );
Returns the one person able to satisfy both conditions.
COUNT can be swapped for EXISTS to ensure a given entity has multiple matching item. For example: “who has at least two older relatives?”
QUERY: Find <= Person( COUNT(Relationship.whoelse(Person.Age<Age)) >= 2 );
The only ‘hard’ part in the above is the Person.Age<Age; who is Person and who is ‘local’. Remember the way to read this expression is that you start at the node represented by the outer level, each time you meet () following and entity or association you are evaluating that node and you travel whenever you see a period. Each time you meet () following an aggregate word (COUNT, EXISTS, SUM etc) you are about to follow multiple paths to accumulate a result). So here we start at a given person node. We scan down each the relationships and we travel to each whoelse in turn. We then evaluate that whoelse node. So inside those () the ‘Age’ refers to the local node (the whoelse). To find the age of the original person we were scanning we need to use Person.Age. Thus to encode ‘has an older relative’, the relative has an Age > Person.Age.
Using exactly these techniques it is possible to specify an arbitrary (constant) subgraph and return all the matches to it. The slightly tedious part is that you have to write new code each time; bit of a shame you can’t just parameterize everything. Of course you can:
QUERY: Find(_Age,Name2) <= Person( _Age, Relationship.whoelse(Name=Name2) );
Illustrates a number of language features:
- You can parameterize queries; in 0.4 it simply means that a parameterized module is created to execute the query. In 0.5 it will also cause keys to be built and a roxie service to be available.
- The _Param case is special; it is for the very common case where you wish to filter an entity based upon the equivalent of a property value. Person( _Age…. ) means ‘of all those people with Age=_Age value’
- The nested relationship query shows the more ‘regular’ use of a parameter (Name2).
- For this query I have also made use of the implicit EXISTs. A dataset used in a Boolean context creates a false if the dataset is empty.
The foregoing hopefully illustrates that the subgraph matching problem can be specified within KEL; what we have not done is made any use of the properties that we have spent three blogs manufacturing. So here is a harder question: “which people co-own a vehicle?”
QUERY: Find <= Person( PerVeh.Veh(Person.PerVeh.Type='OWNS', nOwners > 1) );
Only contains things you have seen before. Notice the Person.PerVeh.Type to pull information from up-path and the nOwners that behaves as if it belongs to vehicle (even though it was manufactured by travelling to other nodes). The latter point is crucial; one of the driving costs of the NP complete nature of the subgraph matching problem is the complexity of the query. Having given ourselves a way to render the run-time part of the query simpler we have significantly improved performance.
That about wraps up the subgraph matching problem; I do however wish to finish with a warning:
We have seen the code to find people with two older relatives. I suspect that if I were to set an exercise to find people that have access to at least two vehicles you would produce:
QUERY: Find <= Person( COUNT(PerVeh.Veh) > 2);
QUERY: Find <= Person( COUNT(PerVeh{Veh}) > 2 );
Adventures in Graphland Series
Part I - Adventures in GraphLand
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand (Graphland gets a Reality Check)
Part VI - Adventures in GraphLand (Coprophagy)