The other evening my family was watching some of our younger tortoises acclimatize to a new enclosure. One of them fell off of a log and face-planted into some recently dropped fecal matter. Far from being perturbed she immediately opened her mouth and started eating. My teenage son was entirely grossed and exclaimed: “Ew; she’s eating poop!” My wife, looking somewhat perplexed responded: “Yes, and that’s not even the good stuff!”
For those not familiar with reptile coprophagy; young tortoises will often eat droppings from older tortoises as it provides useful stomach fauna and a range of partially digested nutrients that the babies might not otherwise have access to. The older (and more senior) the tortoise; the higher the quality of the scat. In the world of baby tortoises: senior tortoise poop is definitely “the good stuff”.
The reason for this memorable if unpalatable introduction is to assert the notion that sometimes we need to ingest into Graphland data that contains useful information even if the current form is not immediately appealing. “The good stuff” often comes in a form suitable for re-digestion; not in the form suitable for knowledge engineering.
In case you are wondering if this is a reprise of ‘GraphLand V’ (dealing with dirty data), it isn’t. Even if this data is ‘the good stuff’ it may still take some significant work and planning to hammer it into a nice Entity Model.
Probably the commonest, and certainly most important case, is where the incoming data is in the form of multi-entity transactions. As a simplified but essentially real world example: this is how vehicle and person data usually turn up:
Each record represents a transaction. Two people are selling a single vehicle to two people. Each transaction therefore provides partial information regarding five different entities. There are also a number of implicit associations I might choose to derive, bought, sold and potentially ‘co-owns’ to represent that two people appeared on the same side of a transaction. The question is how do we take this rather messy looking data and bring it into KEL?
The first rule is that you do NOT construct your entity model based upon the data. You base your model upon how you wish you had the data. In fact we already have this entity model; so I’ll replicate it here with a couple of tweaks.
Hopefully the above will make sense to you; but just in case:
- We have a person entity, two properties – Name and Age. MODEL(*) forces every property to be single valued.
- We have a vehicle entity, two properties – Make and Colour
- We have a ‘CoTransacts’ association between two people
- We have an association from People to Vehicles for which the TYPE is defined in the association.
The question then becomes how do we extract the data from our big, wide transaction record into our entity model? We will start off by extracting the people. We have four to extract from one record. We will do this using the USE statement. You have seen the ‘USE’ statement already – but this one is going to be a little scarier:
- First note that there are 4 Person sections to the USE clause. That is one for each person we are extracting.
- The first of the Person clauses uses the field label override syntax to extract a particular did, name and age from the record.
- The remaining three do exactly the same thing; but they use a piece of syntactic sugar to make life easier. If you say owner2_* then KEL will look for every default label in the entity prepended by owner2_.
Dumping the person entity we now get:
Note that all eight entities from the transaction are in the data (David and Kelly have a _recordcount of 2).
Bringing in the Vehicle is also easy; but it illustrates a useful point:
The field override syntax is used for the UID (otherwise it would expect a UID or ‘uid’ without the prefix). The other two fields (make and colour) are in the record with the default names so they do not need to be overridden. If you like typing; you can fill all of the fields in for completeness; but you don’t need to.
With the entities in, it is time to pull in the associations. CoTransacts first:
The override syntax to assign the unique IDs should be fairly comfortable by now. One thing that might surprise you is that I am using TWO associations for one relationship. I don’t have to do this – I can put one relationship in and walk both ways – but sometimes you want to do the above. We will tackle some of the subtler relationship types in a later blog. The above gives:
By now you should immediately spot that the two different instances of a relationship between 1&2 have been represented using __recordcount = 2.
This is one of those rare cases I am prepared to concede that late-typing an association is useful. We are almost certainly going to want to compare/contrast buy and sell transactions so giving them the same type is useful. So, when registering the relationships from a transaction, I use the ‘constant assignment’ form of the USE syntax to note that there are two buying and two selling relationships being created here. The result:
We have captured everything in the original transaction that is represented in our model. From each transaction record we produce four entity instances and eight association instances. We saw how common consistent naming can produce very succinct KEL (and the work around when the naming is hostile).
In closing I want to present a more complex model that keeps track of transaction dates. I am going to track both the dates over which people Cotransact and also when the buy-sell transactions happen. The association syntax IS quite a bit more exotic than the preceding which I’ll expound upon the details in a later blog.
- Only the ASSOCIATIONs changed
- The ASSOCIATIONs now have a MODEL.
- For CoTransacts this says that a given who/whoelse pair will have one (and only one) association of this type, and we keep track of all the transaction dates
- For PerVeh we have one association for every Per/Veh pair. We then keep a table (called Transaction) detailing the Type and Date of each transaction
With this declaration and the previous data we get CoTransactions:
The two associations with two transactions now carry the date of the transaction. For PerVeh we get:
Many traditional data system take one of three easy views of data structure. Either they work on the data in the format it is in or they assume someone else has massaged the data into shape or they assume data has no real shape.
Even if some of the details are a little fuzzy, and building a strong Entity Model is a non-trivial task, I hope that I have convinced you that in GraphLand you should not take the easy way out. Knowledge has structure and we need to define that structure (using ENTITY, ASSOCIATION and MODEL). If we have to USE data in a structure that is currently unpalatable; we have a digestive system that is able to do so.
Adventures in Graphland Series
Part I - Adventures in GraphLand
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand (Graphland gets a Reality Check)
Part VI - Adventures in GraphLand (Coprophagy)