Blogs

Adventures in GraphLand VI (Coprophagy)

The other evening my family was watching some of our younger tortoises acclimatize to a new enclosure. One of them fell off of a log and face-planted into some recently dropped fecal matter. Far from being perturbed she immediately opened her mouth and started eating. My teenage son was entirely grossed and exclaimed: “Ew; she’s eating poop!” My wife, looking somewhat perplexed responded: “Yes, and that’s not even the good stuff!”

For those not familiar with reptile coprophagy; young tortoises will often eat droppings from older tortoises as it provides useful stomach fauna and a range of partially digested nutrients that the babies might not otherwise have access to. The older (and more senior) the tortoise; the higher the quality of the scat. In the world of baby tortoises: senior tortoise poop is definitely “the good stuff”.

The reason for this memorable if unpalatable introduction is to assert the notion that sometimes we need to ingest into Graphland data that contains useful information even if the current form is not immediately appealing. “The good stuff” often comes in a form suitable for re-digestion; not in the form suitable for knowledge engineering.

In case you are wondering if this is a reprise of ‘GraphLand V’ (dealing with dirty data), it isn’t. Even if this data is ‘the good stuff’ it may still take some significant work and planning to hammer it into a nice Entity Model.

Probably the commonest, and certainly most important case, is where the incoming data is in the form of multi-entity transactions. As a simplified but essentially real world example: this is how vehicle and person data usually turn up:

Each record represents a transaction. Two people are selling a single vehicle to two people. Each transaction therefore provides partial information regarding five different entities. There are also a number of implicit associations I might choose to derive, bought, sold and potentially ‘co-owns’ to represent that two people appeared on the same side of a transaction. The question is how do we take this rather messy looking data and bring it into KEL?

The first rule is that you do NOT construct your entity model based upon the data. You base your model upon how you wish you had the data. In fact we already have this entity model; so I’ll replicate it here with a couple of tweaks.

Hopefully the above will make sense to you; but just in case:

  1. We have a person entity, two properties – Name and Age. MODEL(*) forces every property to be single valued.
  2. We have a vehicle entity, two properties – Make and Colour
  3. We have a ‘CoTransacts’ association between two people
  4. We have an association from People to Vehicles for which the TYPE is defined in the association.

The question then becomes how do we extract the data from our big, wide transaction record into our entity model? We will start off by extracting the people. We have four to extract from one record. We will do this using the USE statement. You have seen the ‘USE’ statement already – but this one is going to be a little scarier:

  • First note that there are 4 Person sections to the USE clause. That is one for each person we are extracting.
  • The first of the Person clauses uses the field label override syntax to extract a particular did, name and age from the record.
  • The remaining three do exactly the same thing; but they use a piece of syntactic sugar to make life easier. If you say owner2_* then KEL will look for every default label in the entity prepended by owner2_.

Dumping the person entity we now get:

Note that all eight entities from the transaction are in the data (David and Kelly have a _recordcount of 2).

Bringing in the Vehicle is also easy; but it illustrates a useful point:

The field override syntax is used for the UID (otherwise it would expect a UID or ‘uid’ without the prefix). The other two fields (make and colour) are in the record with the default names so they do not need to be overridden. If you like typing; you can fill all of the fields in for completeness; but you don’t need to.

With the entities in, it is time to pull in the associations. CoTransacts first:

The override syntax to assign the unique IDs should be fairly comfortable by now. One thing that might surprise you is that I am using TWO associations for one relationship. I don’t have to do this – I can put one relationship in and walk both ways – but sometimes you want to do the above. We will tackle some of the subtler relationship types in a later blog. The above gives:

By now you should immediately spot that the two different instances of a relationship between 1&2 have been represented using __recordcount = 2.

Finally PerVeh:

This is one of those rare cases I am prepared to concede that late-typing an association is useful. We are almost certainly going to want to compare/contrast buy and sell transactions so giving them the same type is useful. So, when registering the relationships from a transaction, I use the ‘constant assignment’ form of the USE syntax to note that there are two buying and two selling relationships being created here. The result:

We have captured everything in the original transaction that is represented in our model. From each transaction record we produce four entity instances and eight association instances. We saw how common consistent naming can produce very succinct KEL (and the work around when the naming is hostile).

In closing I want to present a more complex model that keeps track of transaction dates. I am going to track both the dates over which people Cotransact and also when the buy-sell transactions happen. The association syntax IS quite a bit more exotic than the preceding which I’ll expound upon the details in a later blog.

Notes:

  • Only the ASSOCIATIONs changed
  • The ASSOCIATIONs now have a MODEL.
  • For CoTransacts this says that a given who/whoelse pair will have one (and only one) association of this type, and we keep track of all the transaction dates
  • For PerVeh we have one association for every Per/Veh pair. We then keep a table (called Transaction) detailing the Type and Date of each transaction

With this declaration and the previous data we get CoTransactions:

The two associations with two transactions now carry the date of the transaction. For PerVeh we get:

Many traditional data system take one of three easy views of data structure. Either they work on the data in the format it is in or they assume someone else has massaged the data into shape or they assume data has no real shape.

Even if some of the details are a little fuzzy, and building a strong Entity Model is a non-trivial task, I hope that I have convinced you that in GraphLand you should not take the easy way out. Knowledge has structure and we need to define that structure (using ENTITY, ASSOCIATION and MODEL). If we have to USE data in a structure that is currently unpalatable; we have a digestive system that is able to do so.

In pursuit of perils : Geo-spatial risk analysis through HPCC Systems

I have always wanted to say I was working on big data projects and in my own time I have dabbled with frameworks such as Hadoop on Amazon EC2 but could not make the case for merging my day job with the big data aspiration. Now enter LexisNexis® Risk Solutions, who acquired our company in 2013, who, guess what, excel in big data through HPCC Systems® (http://hpccsystems.com/), their open source big data processing platform that powers their data services business. At last, there was potential to merge the big data aspiration with the day job. Nice.

Our solution is called Map View, which is is used by over 8,000 home and commercial property underwriters worldwide. Map View is a geo-spatial risk intelligence solution for the insurance industry; which for the most part consists of correlating existing and potential policy property locations with commercial and open peril data. This peril data informs the insurer what existing and potential risk exposure exists at the provided locations. Peril data consists of models pertaining to such risks as coastal flood, river flood, surface water, earthquake, subsidence, crime, wind and more. The insurer uses this risk assessment to decide what the suitable premium should be based on the severity of the discovered risks. Insurers also use this information to review clusters of risk exposure at various locations.

So that’s what we have been doing, and it has been going well from a technical point of view, but the bottleneck has always been scale, although we optimised our algorithms to play nice with the spatial relational database platform we were using, it still did not realise the order of magnitude performance improvements we needed.

Now for the crux of what this blog entry is about: leveraging HPCC Systems as our geo-spatial analysis engine and repository of our vector geometry and raster data.

The geo-spatial algorithms and binary geo-reference file handlers that we have been using did not exist in HPCC Systems, but Enterprise Control Language (ECL) the programming language for HPCC Systems, is extendable. It was clear, that to bring our geospatial processes into the realm of big data, we needed to extend ECL and that’s when we reached out to our colleagues in the greater LexisNexis family. We reached out to the HPCC Systems algorithms group within LexisNexis technology, who meet on a regular basis. The group was the perfect sounding board for our first experiments in using bespoke C++ with HPCC Systems. Thanks to these guys we got a C++ plug-in skeleton up and running.

A few of us met in the Dublin office to determine the core spatial capabilities we would need in HPCC Systems. Having established this list, rather than implementing algorithms from scratch, we were familiar
with some geo-spatial libraries that include most if not all of what we were looking for, so no need to cherry pick, let’s just integrate the entire library and expose what we need to ECL.

The main capabilities we are looking for fall into 4 categories:

  1. Spatial filtering of vector geometries
  2. Spatial operations using vector geometries
  3. Spatial reference projection and transformation
  4. Reading of compressed geo-raster files

The libraries we focused on were GDAL/OGR, GEOS, and Proj.4; The GDAL/OGR library built to include the use of the GEOS library provides spatial filtering satisfying the use of DE-9IM spatial predicates (e.g. within, touches, intersects etc...) and also provides various spatial operations such as distance and convex-hull etc.; OGR which is attached at the hip to GDAL is a simple feature library and it accepts vector geometries as Well Known Text (WKT) wrapped as a string literal. The Proj.4 library encapsulates functionality for the projection of spatial references e.g. WGS84, UTM, Web Mercator, or NAD83 etc.; it also is used for the transformation of coordinates from one spatial reference system to another. GDAL provides functionality which allows for the reading of GEOTiff binary raster files.

Sample use case: augmenting building data with derived peril fields

The following map shows various buildings (labelled 1-7), a fire station, a police station, and a large flood zone based on a 500m buffer either side of a river.

For this sample use case, the business objective is to augment an existing dataset of buildings with peril information by correlating their physical location with available geographic peril data. The ECL executed within HPCC Systems will use the GDAL library to perform the distance and point-in-polygon spatial operations.

Image showing a sample map showing elements used for augmenting building data with derived peril fields

In particular we want to augment the building data with 3 perils:

  1. Fire risk category based on distance away from fire station
  2. Crime risk category based on distance away from police station
  3. Flood risk flag based on being within a flood zone

We start off using this buildings dataset.

An image illustrating the use of an ECL Transform to augment building data with inferred geo-spatial values

The ECL used to produce the above result:

import $.GeometryLite as Geometry; 

BuildingLayout := RECORD
  STRING geom;
  INTEGER4 id;
END;

PerilAugmentedBuildingLayout := RECORD
  BuildingLayout;
  INTEGER4 distanceToFireStation;
  STRING1 fireRiskCategory;
  INTEGER4 distanceToPoliceStation;
  STRING1 crimeRiskCategory;
  BOOLEAN isAtRiskOfFlood;
END;

// A recordset of points in WKT and using WGS84 as the coordinate system
buildings_RSet := DATASET(
  [
    {'POINT(-84.26180076961152565 34.07911885643056848)',	7},
    {'POINT(-84.28096707737662996 34.07386384848369687)',	6},
    {'POINT(-84.28034594703238724 34.06441871524530995)',	5},
    {'POINT(-84.27511070555951278 34.07632602535307598)',	4},
    {'POINT(-84.26401909226950693 34.07132809899221115)',	3},
    {'POINT(-84.2606472418293464 34.06904953470633046)',	2},
    {'POINT(-84.25332677705793571 34.07173235399688593)',	1}
  ],
  BuildingLayout
);

// SRID = Spatial Reference System Identifier, and in this case correlates to the matching EPSG id (http://www.epsg.org/)
// Universal Transverse Mercator (UTM) Zone 16 North... X,Y in meters, good for showing local distances
UTMZ16N_SRID := 32616; 

// World Geodetic System (WGS) ... Longitude,Latitude good for using as the base coordinate system
WGS84_SRID := 4326; 

// The location of the fire station given as WGS84 and also converted to a local UTM point
fire_station_point := 'POINT(-84.27361647195701266 34.07592838651884648)';
fire_station_point_UTMZ16N := Geometry.toSRID(fire_station_point,WGS84_SRID,UTMZ16N_SRID);

// The location of the police station given as WGS84 and also converted to a local UTM point
police_station_point := 'POINT(-84.28388903577211977 34.06841445050786632)';
police_station_point_UTMZ16N := Geometry.toSRID(police_station_point,WGS84_SRID,UTMZ16N_SRID);

// The large flood zone...pre-generated from a line representing a segment of river path and buffered 500m both sides of that line
river_flood_buffer_polygon := 'POLYGON ((-84.275569480814269 34.053149448508812,-84.275502368216152 34.053622020238336,-84.275461577548469 34.053856279819868,-84.275406062884485 34.054088433147875,-84.275335977314512 34.054317839528537,-84.275251514146206 34.054543865844074,-84.275152906372298 34.054765888300402,-84.275040426028681 34.05498329414894,-84.274914383444596 34.055195483377922,-84.274775126387269 34.055401870368861,-84.274623039102778 34.055601885513035,-84.274458541256465 34.055794976783979,-84.274282086775159 34.055980611261425,-84.274094162595034 34.056158276602602,-84.273895287317842 34.056327482456609,-84.273686009780079 34.056487761818282,-84.27346690753815 34.056638672317504,-84.271467966412999 34.057943350956521,-84.271449849097706 34.057955123901209,-84.269457062666064 34.059244375758553,-84.267464263063459 34.060541242500669,-84.267230385493164 34.060685184060439,-84.266987555911015 34.060818462943388,-84.26673647534858 34.060940694367893,-84.266477868667863 34.061051525446047,-84.266212482468021 34.061150636202917,-84.265941082929231 34.06123774050058,-84.265664453599811 34.061312586864624,-84.265383393133249 34.06137495921039,-84.265341784003652 34.061382226124834,-84.265314901719449 34.061585486783336,-84.265310183323066 34.061619943334122,-84.265111905163181 34.063019950080957,-84.26507118648162 34.063254301019093,-84.265015731142569 34.063486548072767,-84.264945692179822 34.063716049839776,-84.264861262907829 34.063942172494826,-84.264762676389054 34.064164291540052,-84.264650204791437 34.064381793530238,-84.264524158637713 34.064594077767168,-84.264384885948843 34.064800557959067,-84.264232771283574 34.065000663840181,-84.264068234677239 34.065193842746154,-84.263891730482143 34.06537956114078,-84.263703746113279 34.065557306090064,-84.26350480070252 34.06572658667919,-84.263295443665044 34.065886935368937,-84.263076253182007 34.066037909287367,-84.262751452284149 34.066249907042753,-84.262614616529277 34.06728718811172,-84.262609735088645 34.067322879116531,-84.262411456952634 34.068722885868624,-84.262370565581591 34.06895807170757,-84.262314831409327 34.069191136361503,-84.262244409341704 34.069421431556577,-84.262159495144019 34.069648316717327,-84.262060324897817 34.06987116074864,-84.26194717434521 34.070089343791523,-84.261820358123074 34.070302258947599,-84.261680228888764 34.070509313967584,-84.261527176339996 34.070709932899128,-84.261361626131588 34.070903557689206,-84.261184038692122 34.071089649736948,-84.260994907943669 34.071267691392244,-84.260794759928146 34.071437187396057,-84.260584151344403 34.071597666258612,-84.260363667999641 34.071748681571322,-84.258364726197328 34.073045730610005,-84.258134929191982 34.073186884602521,-84.257896506679955 34.073317758170241,-84.257650122125995 34.07343798711355,-84.257396461160667 34.073547236854345,-84.257136229671985 34.073645203367583,-84.257122298651311 34.073649727584275,-84.256948086173011 34.073809590460101,-84.256741016186922 34.073980186060624,-84.256523217046393 34.07414126483485,-84.254624040614956 34.075471446270598,-84.254724400109581 34.075597079684485,-84.254869780406139 34.075801719022742,-84.255001988909029 34.076012453715521,-84.255120657537674 34.076228697222078,-84.255225455889686 34.076449847663952,-84.255316092161522 34.076675289500066,-84.255392313961806 34.076904395239524,-84.255453909015031 34.077136527188117,-84.25550070575369 34.077371039223024,-84.255532573797112 34.077607278591138,-84.255549424315745 34.077844587725856,-84.255551210279833 34.078082306077405,-84.255537926591671 34.078319771951413,-84.255509610101271 34.078556324350842,-84.255466339505119 34.078791304815994,-84.255161247280483 34.080198945384261,-84.255103785139212 34.080429419521792,-84.255031936455268 34.080657082246773,-84.254945897195938 34.08088131225167,-84.254868679322044 34.081051343080176,-84.237240155349298 34.081049738427836,-84.237280208789684 34.080978287916651,-84.237412814623895 34.080770733580913,-84.237558247861628 34.080569215695952,-84.237716112647789 34.080374282655967,-84.237885979306 34.080186464932545,-84.238067385508458 34.080006273631433,-84.238259837534528 34.079834199102123,-84.238462811614468 34.079670709603825,-84.238675755355032 34.07951625003173,-84.238898089242582 34.079371240706635,-84.239129208220191 34.079236076231616,-84.23936848333372 34.079111124418603,-84.239615263443213 34.07899672528783,-84.239868876994265 34.078893190142935,-84.240128633844932 34.07880080072416,-84.240393827143151 34.078719808441974,-84.240663735249441 34.078650433693163,-84.240937623699878 34.078592865261356,-84.241214747203784 34.078547259803486,-84.241494351670951 34.078513741423592,-84.241775676262534 34.078492401335303,-84.242057955460552 34.078483297613602,-84.242276383939895 34.078485739280964,-84.242492265400855 34.078344838751413,-84.242669896687488 34.07823372485575,-84.243425452296762 34.077780983762665,-84.243369611642905 34.077645512126423,-84.243292744497253 34.077422098004902,-84.243229822889347 34.077195683226336,-84.243181014084243 34.0769668700799,-84.243146447802232 34.07673626722972,-84.243126215874952 34.076504488095637,-84.243120372002366 34.076272149221445,-84.24312893161121 34.076039868634815,-84.243151871815115 34.075808264203388,-84.243189131476854 34.075577951991299,-84.243461269006659 34.074171169484238,-84.243661428789082 34.072767093882938,-84.243859572282346 34.071365470820695,-84.243900570306963 34.071129462053882,-84.243956514650279 34.070895588727439,-84.244027248504821 34.070664505863085,-84.244112573646277 34.07043686066212,-84.244212250989861 34.070213290693019,-84.244326001261058 34.06999442210612,-84.244453505778935 34.069780867880262,-84.244594407349553 34.069573226106428,-84.244748311267287 34.069372078313052,-84.244914786420949 34.069177987837776,-84.245093366501834 34.068991498250192,-84.245283551310166 34.068813131829963,-84.24548480815632 34.068643388104555,-84.24569657335293 34.068482742450705,-84.245918253793548 34.06833164476361,-84.246149228613717 34.068190518197326,-84.246388850929605 34.068059757980151,-84.246636449649188 34.067939730308254,-84.246891331351378 34.06783077132043,-84.247152782227332 34.067733186157128,-84.247420070078917 34.067647248106269,-84.247692446368404 34.06757319783808,-84.24796914831397 34.067511242731527,-84.248249401024921 34.067461556293573,-84.248295264997253 34.067455515234712,-84.248322821978107 34.067402432756531,-84.248449719882316 34.067189575141107,-84.248589924187186 34.066982580508565,-84.248743044927195 34.066782024461098,-84.248908656228465 34.066588464693275,-84.249086297493619 34.066402439441589,-84.249275474683074 34.066224465988277,-84.249475661689175 34.066055039223322,-84.249686301799471 34.065894630268701,-84.249906809244763 34.065743685168833,-84.250136570827962 34.065602623650683,-84.250374947629027 34.065471837957027,-84.250621276781317 34.06535169175622,-84.250874873314345 34.065242519131296,-84.25113503205796 34.065144623651378,-84.251401029602519 34.065058277527839,-84.251672126309671 34.064983720857754,-84.251947568368251 34.064921160956473,-84.252035386818392 34.064905301684796,-84.25204925128071 34.064800299544302,-84.252093345968433 34.064537746115043,-84.252398611067974 34.063042402369028,-84.252453765203541 34.06280954236135,-84.252523575220749 34.062579422357807,-84.252607847320334 34.062352680810513,-84.252706347584663 34.062129946793526,-84.252818802627843 34.061911838257828,-84.25294490035526 34.06169896031723,-84.253084290830429 34.061491903569909,-84.253236587246619 34.061291242460207,-84.253401367000805 34.061097533685391,-84.253578172866739 34.060911314651555,-84.253766514263859 34.060733101983146,-84.253965868618735 34.060563390090095,-84.254175682815074 34.0604026497967,-84.254395374728318 34.060251327035772,-84.254617325955067 34.06010647236338,-84.254750577269093 34.059099870797915,-84.254794130832508 34.058840403128563,-84.255099396160134 34.057341244692481,-84.255154469659445 34.05710822660803,-84.255224216885622 34.056877947209394,-84.255308443981519 34.056651046156581,-84.255406916874094 34.056428153720624,-84.255519361925792 34.056209889033113,-84.255645466695697 34.055996858366782,-84.255784880808292 34.055789653451782,-84.255937216927606 34.055588849832475,-84.256102051833835 34.055395005269098,-84.256278927599539 34.055208658188938,-84.25646735286206 34.055030326191243,-84.256666804188725 34.054860504609906,-84.256876727531036 34.054699665138081,-84.257096539763594 34.054548254518501,-84.258948432267829 34.053339603420973,-84.275569480814269 34.053149448508812))';



// A transform used to augment the details of a building with dereived peril information
PerilAugmentedBuildingLayout PerilAugmentTransform(BuildingLayout L) := TRANSFORM
  // convert coordinate reference system of point to UTM 16 N so that 
  // distance is calculated in meters
  STRING geom_UTMZ16N := Geometry.toSRID(L.geom,WGS84_SRID,UTMZ16N_SRID);
  
  // Perform distance calculations
  SELF.distanceToFireStation := Geometry.distanceBetween(geom_UTMZ16N, fire_station_point_UTMZ16N,UTMZ16N_SRID);
  SELF.distanceToPoliceStation := Geometry.distanceBetween(geom_UTMZ16N, police_station_point_UTMZ16N,UTMZ16N_SRID);
  
  // Perform point in polygon assertion
  SELF.isAtRiskOfFlood := Geometry.isWithin(L.GEOM,river_flood_buffer_polygon,WGS84_SRID);
  
  // Apply rules for fire and crime
  SELF.fireRiskCategory := MAP(
              SELF.distanceToFireStation <= 500 => 'D',
              SELF.distanceToFireStation <= 1000 => 'C',
              SELF.distanceToFireStation <= 1500 => 'B',
              'A'
            );
  SELF.crimeRiskCategory := MAP(
              SELF.distanceToPoliceStation <= 600 => 'D',
              SELF.distanceToPoliceStation <= 1200 => 'C',
              SELF.distanceToPoliceStation <= 1900 => 'B',
              'A'
            );
  SELF := L;
END;

// Generate a new recordset with Peril details augmented
buildingsWithPerilsRset := PROJECT(buildings_RSet, PerilAugmentTransform(LEFT));

OUTPUT(buildingsWithPerilsRset,NAMED('Buildings_With_Perils')); 

	

Integrating spatial libraries into HPCC Systems

Integrating the spatial libraries into ECL via Inline C++:

Inline C++ within ECL is a convenient mechanism to extend ECL; it does not require building from the HPCC Systems source code and we do not need any external team to support incremental releases.

Installing the spatial C++ libraries, getting your Ubuntu based HPCC platform ready:

To use the spatial libraries within ECL BEGINC++ functions, you will still need to install the spatial libraries [once] to each node on the HPCC Systems cluster.

We installed this onto the 4.X Ubuntu VM image available for download on the HPCC Systems website (http://hpccsystems.com/download/hpcc-vm-image).

The minimum versions of the spatial libraries used in this article:

  • GEOS >= 3.2.2
  • PROJ.4 >= 4.7.0
  • GDAL >= 1.10.1
File : [ install.sh ]
  • Copy "install.sh" via ssh to a folder on the HPCC Systems Ubuntu VM
  • Uncomment the dependencies line if required
  • Execute "install.sh" to install the spatial libraries
#!/bin/bash

# dependencies : if not already installed
# sudo apt-get -y install g++ gcc make cmake bison flex binutils-dev libldap2-dev libcppunit-dev libicu-dev libxalan110-dev zlib1g-dev libboost-regex-dev libssl-dev libarchive-dev  libapr1-dev libaprutil1-dev subversion

# install PROJ.4
sudo apt-get -y install libproj-dev

# install GEOS
sudo apt-get -y install libgeos-dev

# compile and install GDAL
svn co https://svn.osgeo.org/gdal/branches/1.10/gdal gdal_stable

cd gdal_stable
./configure  --prefix=/usr --with-threads --with-ogr --with-geos --without-libtool --with-libz=internal --with-libtiff=internal --with-geotiff=internal --with-gif --without-pg --without-grass --without-libgrass --without-cfitsio --without-pcraster --without-netcdf --with-png --with-jpeg --without-ogdi --without-fme --without-hdf4 --without-hdf5 --without-jasper --without-ecw --without-kakadu --without-mrsid --without-jp2mrsid --without-bsb --without-grib --without-mysql --without-ingres --with-xerces --without-expat --without-odbc --with-curl --without-sqlite3 --without-dwgdirect --without-panorama --without-idb --without-sde --without-perl --without-php --without-ruby --without-python --without-ogpython --with-hide-internal-symbols
sudo make install
cd ../

	
File : [GeometryLite.ecl]

Contains a Geometry module "GeometryLite", truncated for the purposes of demonstration just to include 4 spatial functions : SpatialReferenceForSRID, distanceBetween, tranformToProjection , and hasSpatialRelation.

IMPORT $;

EXPORT GeometryLite := MODULE

      EXPORT ExtLibSpatial := MODULE
        /* SpatialReferenceForSRID

              Given a SRID return the WKT representing the Spatial Projection Details for that SRID
        */
        export STRING SpatialReferenceForSRID(INTEGER4 srid) :=
                BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

using namespace std;

#body
// #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure

char *wktOut;

// determine the spatial reference details
OGRSpatialReference * poSRS = new OGRSpatialReference(NULL);
poSRS->importFromEPSG(srid);

poSRS->exportToWkt(&wktOut);

// copy string into a char array
unsigned len = strlen(wktOut);
char * out = (char *) malloc(len);
for(unsigned i=0; i < len; i++) {
    out[i] = wktOut[i];
}

// free resources
free(wktOut);
OGRSpatialReference::DestroySpatialReference(poSRS);


//return result to ECL
__result = out;

// set length of return string
__lenResult = len;
ENDC++;
/* tranformToProjection

             Transform a geometry from one SRID projection to another
*/
EXPORT string tranformToProjection(const string  geom,  STRING srs1, STRING srs2):=
        BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

        using namespace std;

#body
// #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure

OGRGeometry *thisGeom;
char *wkt;
char* wktIn = (char*) geom;


// determine the spatial reference details
char* wktSRSSourceIn = (char*) srs1;
OGRSpatialReference *sourceSRS = new OGRSpatialReference(NULL);
sourceSRS->importFromWkt(&wktSRSSourceIn);

char* wktSRSTargetIn = (char*) srs2;
OGRSpatialReference *targetSRS = new OGRSpatialReference(NULL);
targetSRS->importFromWkt(&wktSRSTargetIn);


// create geometry from given WKT
OGRErr err = OGRGeometryFactory::createFromWkt(&wktIn, sourceSRS, &thisGeom);

thisGeom->transformTo(targetSRS);

thisGeom->exportToWkt(&wkt);

unsigned len = strlen(wkt);

// copy string into a char array
char * out = (char *) malloc(len);
for(unsigned i=0; i < len; i++) {
    out[i] = wkt[i];
}

//return result to ECL
__result = out;

// set length of return string
__lenResult = len;

free(wkt);
OGRSpatialReference::DestroySpatialReference(sourceSRS);
OGRSpatialReference::DestroySpatialReference(targetSRS);
OGRGeometryFactory::destroyGeometry(thisGeom);
ENDC++;


/* distanceBetween

        Get the distance between the 2 given WKT geometries, the distance unit returned depdends on the SRID used
*/
EXPORT REAL8 distanceBetween(const string  geom1, const string  geom2, STRING srs):=
        BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

        using namespace std;

#body

// #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure


// determine the spatial reference details
char* wktSRSIn = (char*) srs;
OGRSpatialReference * poSRS = new OGRSpatialReference(NULL);
poSRS->importFromWkt(&wktSRSIn);

bool hasAtLeastOneValidRelation = false;

char* wktInLeft = (char*) geom1;
char* wktInRight = (char*) geom2;

OGRGeometry *leftOGRGeom;
OGRGeometry *rightOGRGeom;

bool loadedOK = false;
OGRErr err =  NULL;

err = OGRGeometryFactory::createFromWkt(&wktInLeft, poSRS, &leftOGRGeom);
loadedOK = (err == OGRERR_NONE);

err = OGRGeometryFactory::createFromWkt(&wktInRight, poSRS, &rightOGRGeom);
loadedOK = (err == OGRERR_NONE);

double distance = leftOGRGeom->Distance(rightOGRGeom);

OGRGeometryFactory::destroyGeometry(leftOGRGeom);
OGRGeometryFactory::destroyGeometry(rightOGRGeom);
OGRSpatialReference::DestroySpatialReference(poSRS);

return distance;
ENDC++;

/* hasSpatialRelation

     Do the two given WKT geometries have at least one of the expected relations defined in relationTypeORBits [a single INT containing OR'd bits]

     @see <a href="http://en.wikipedia.org/wiki/DE-9IM">Wikipedia</a>

     usage:
     hasSpatialRelation("POINT(? ?)","POLYGON((? ?,? ?,? ?,? ?,? ?))", rel.WITHIN | rel.OVERLAPS, SRS(4326));


     @param geom1 STRING containing a WKT geometry, left side of predicate assertion
     @param geom2 STRING containing a WKT geometry, right side of predicate assertion
     @param rel INTEGER contains one or more bits representing what spatial relations should be evaluated
     @param srs the WKT Spatial reference details as got from Operation.SRS
*/
EXPORT boolean hasSpatialRelation(const string  geom1, const string  geom2, INTEGER rel, STRING srs):=
        BEGINC++
#option library 'geos'
#option library 'proj'
#option library 'gdal'

// #option once : Indicates the function has no side effects and is evaluated at query execution time, even if the parameters are constant, allowing the optimizer to make more efficient calls to the function in some cases.
#option once

#include <iostream>
#include <sstream>
#include <string>
#include "ogrsf_frmts.h" // GDAL
#include "cpl_conv.h"
#include "gdal_priv.h"

    /**
    Enumeration of all supported relation types
    */
        namespace RelationType {
    enum SpatialPredicate {
        INTERSECTS = 1 << 0,
        TOUCHES = 1 << 1,
        DISJOINT = 1 << 2,
        CROSSES = 1 << 3,
        WITHIN = 1 << 4,
        CONTAINS = 1 << 5,
        OVERLAPS = 1 << 6,
        EQUALS = 1 << 7
    };

    bool isBitwiseSpatialPredicate(int packedInteger, RelationType::SpatialPredicate predicate) {
        return (packedInteger & predicate) == predicate ;
    }
}

using namespace std;

#body

  // #body : tell HPCC that everything up to this point is in global scope and that
// the following section is to be encapsulated into a function/procedure


// determine the spatial reference details
char* wktSRSIn = (char*) srs;
OGRSpatialReference * poSRS = new OGRSpatialReference(NULL);
poSRS->importFromWkt(&wktSRSIn);

bool hasAtLeastOneValidRelation = false;

char* wktInLeft = (char*) geom1;
char* wktInRight = (char*) geom2;

OGRGeometry *leftOGRGeom;
OGRGeometry *rightOGRGeom;

bool loadedOK = false;
OGRErr err =  NULL;

// parse geom 1
err = OGRGeometryFactory::createFromWkt(&wktInLeft, poSRS, &leftOGRGeom);
loadedOK = (err == OGRERR_NONE);

if(loadedOK) {
    // parse geom 2
    err = OGRGeometryFactory::createFromWkt(&wktInRight, poSRS, &rightOGRGeom);
    loadedOK = (err == OGRERR_NONE);

    if(loadedOK) {
        // assert if a relation exists
        int relationTypePackedBitwise = rel;
        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::INTERSECTS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Intersects(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::TOUCHES)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Touches(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::DISJOINT)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Disjoint(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::CROSSES)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Crosses(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::WITHIN)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Within(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::CONTAINS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Contains(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::OVERLAPS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Overlaps(rightOGRGeom);
        } 

        if( !hasAtLeastOneValidRelation && RelationType::isBitwiseSpatialPredicate(relationTypePackedBitwise , RelationType::EQUALS)) {
            hasAtLeastOneValidRelation = leftOGRGeom->Equals(rightOGRGeom);
        }
        // clean right
        OGRGeometryFactory::destroyGeometry(rightOGRGeom);
    }
    // clean left
    OGRGeometryFactory::destroyGeometry(leftOGRGeom);
}
// return result
return hasAtLeastOneValidRelation;
ENDC++;

EXPORT SRS :=  SpatialReferenceForSRID;
END;

EXPORT Filter :=  MODULE
      /*
      Bitwise enumeration for all possible Spatial Relations

      Can be combined e.g.

      All WITHIN or TOUCHING OR INTERSECTING = RelationType.WITHIN | RelationType.TOUCHES | RelationType.INTERSECTS
      */
      EXPORT RelationType := ENUM
        (
            INTERSECTS = 1 << 0,
            TOUCHES = 1 << 1,
            DISJOINT = 1 << 2,
            CROSSES = 1 << 3,
            WITHIN = 1 << 4,
            CONTAINS = 1 << 5,
            OVERLAPS = 1 << 6,
            EQUALS = 1 << 7
        );


/*
     hasSpatialRelation

     Does [this] and [thatOther] have one of the bitwise RelationTypes given in [relationTypeORBits] ?
*/
EXPORT BOOLEAN hasSpatialRelation(const string  this, const string  thatOther, INTEGER relationTypeORBits,
                                  INTEGER4 srid) := FUNCTION
                                              STRING srs := ExtLibSpatial.SRS(srid);
return ExtLibSpatial.hasSpatialRelation(this,thatOther,relationTypeORBits, srs);
END;

/*
    isWithin
*/
EXPORT BOOLEAN isWithin(const string  thisGeom, const string  thatOtherGeom, INTEGER4  srid) := FUNCTION
            return hasSpatialRelation(thisGeom,thatOtherGeom,RelationType.WITHIN, srid);
END;
END;

EXPORT Operation :=  MODULE
      EXPORT STRING tranformToProjection(const string geometryWKT, INTEGER4 sourceSRID, INTEGER4 targetSRID) := FUNCTION
                  STRING srs1 :=  ExtLibSpatial.SRS(sourceSRID);
                  STRING srs2 :=  ExtLibSpatial.SRS(targetSRID);

           return ExtLibSpatial.tranformToProjection(geometryWKT,srs1,srs2);
END;

/*
    distanceBetween

    Calculate the distance between 2 points, using the projection given by srid
*/
EXPORT REAL8 distanceBetween(const string  point_A_WKT, const string  point_B_WKT, INTEGER4 srid) := FUNCTION
            STRING srs := ExtLibSpatial.SRS(srid);
return ExtLibSpatial.distanceBetween(point_A_WKT,point_B_WKT,srs);
END;
END;

EXPORT toSRID :=  Operation.tranformToProjection;
EXPORT distanceBetween :=  Operation.distanceBetween;
EXPORT isWithin :=  Filter.isWithin;
END;

	

What about the binary raster files?

Raster files like the GEOTiff format represent a large matrix of peril values where each pixel correlates to x meters; x relates to a fixed resolution. Using the dfuplus HPCC Systems client tool we were able to spray all raster files contained in the HPCC Systems drop-zone into the data/blob fields of a dataset. We then used GDAL through inline C++ to read the blob data for each record (1 record per raster file) and retrieve the raster values corresponding to a X,Y coordinate.

Representing geometries in datasets, best practice not prescription:

EXPORT Layout := RECORD
  STRING GEOM := '';
END;
	

All the geometries used in this blog post have been defined as a WKT (Well-Known-Text) string literal. WKT is widely accepted and widely supported. In pursuit of a canonical way of defining geometries I think that a record layout with a GEOM field containing a WKT string literal is a good start.

For XML datasets based on geo reference data like GML (Geography Mark-up Language); the geometry elements could be translated into WKT for the purposes of canonical search and retrieval. Once relevant matching WKT GEOM records are found through a process of spatial filtering, the associated detail required could then be retrieved from the original GML record.

About the author:

Andrew is a software engineer based in the LexisNexis Risk, Dublin, Ireland office. His role primarily consists of developing geo-spatial risk solutions using Java, he is an advocate of clean code, and has a keen interest in working with distributed systems.

Acknowledgements:

Thanks to everyone for the help along the way:

  • Algorithms team:
    • Charles Kaminski
    • Greg Thompson
    • Jack Coleman
  • GIS Modelling Team, Dublin, Ireland:
    • Daniel Bang
  • Product Development Team, Dublin, Ireland:
    • Jason Moore
    • Greg McRandal
  • HPCC Systems Core Team:
    • Richard Chapman
    • Gordon Smith
    • Gavin Halliday
  • Training:
    • Bob Foreman
    • Alan Daniels

More information:

HPCC Systems 5.0

5.0 has been released! Right on time for the festivities celebrating the third anniversary of our Open Source HPCC Systems platform!

Together with a plethora of tweaks and improvements, it features a number of new capabilities including internationalization and translation into Chinese, Spanish, Hungarian, Serbian and Portuguese, a redesigned ECL Watch which offers a significant face-lift over our traditional management interface, a large number of visualizations that can be driven directly from within ECL, a more natural way to embed foreign functions into ECL, for Python, R, Java and JavaScript (and even a neat teaser on embedding SQLite), and so many more that it would be difficult to enumerate in a single blog post. An extensive list of new features, changes and enhancements is available as a sticky post in our forums.

What makes this event even more exciting is the fact that a number of new integrations are available too: How would you like to use SQL queries through a web services interface, to access published Roxie ECL queries? WsSQL now makes this possible! And how about a complete deployment of an entire HPCC environment with a click of a mouse under Canonical’s Ubuntu Juju? The free HPCC charm is available in the Juju store! And if you use Nagios, the new seamless integration allows direct monitoring of HPCC infrastructure with your existing monitoring system.

This is a great milestone for the HPCC community; so head over to the Downloads section now, and don’t forget to use the forums to tell others about your experience, get answers to your questions and socialize with the rest of the community.

And Happy Birthday, HPCC Systems!

Does Big Data Equal Big Security Problems?

On May 1, the report “Big Data: Preserving Values, Seizing Opportunities” was released by the Executive Office of the President in response to a directive from President Obama to examine the impact of big data technology on society at large.

Big data is a movement in its own right, but after the White House report was released there was an influx of media articles questioning big data, and in particular, the safety of big data. Questions began to circulate on not only how secure the data is but also on the privacy rights of the citizens' records whose very personal information is stored in this data.

For example, GigaOM posted an article titled “It’s 2014. Do You Know Where your Data Is?” and a LinkedIn blog that declared, “Big Data has Big Problems.” Both stories addressed the security and privacy of information stored and utilized for big data purposes.

Recently, I gave an interview to discuss how LexisNexis Risk Solutions protects our data and customer information as well as address the recent concerns raised in the media regarding big data, and what we are doing on our HPCC Systems big data platform to maximize security. Below is the Q&A transcript from the interview.


Moderator: Why is LexisNexis’ information safe and then why should customers trust us?
Flavio Villanustre: We are secure because we have a comprehensive security and privacy program in place. We continuously test our security posture and address any weaknesses that we may find, and we also have state of the art controls around access to the data.

But security goes far beyond just technology. Security isn’t just about making your software secure so that it cannot be breached, you need to also make your processes secure. You need to provide awareness to your employees so that they don’t get socially engineered, for example, and apply controls around security across your entire organization. It’s not just technology, and it’s not just customer support or customer operations.

What are some specific things we do to protect the data?
We do a lot of things. On the administrative side, we have a comprehensive set of security policies, procedures and standards. We provide security through training and we require that employees have knowledge of our policies. We do internal and external (independent third party) risk assessments to ensure that every part of the process is assessed from a risk standpoint, and that controls are commensurate with the exposure and overall risk.

We also employ technical controls, which are things like firewalls and network segmentation, data loss prevention systems and anti-malware protection systems. Administrative and technical controls complement each other.

In addition, we draw a distinction between “security” and “compliance.” Too often, we see organizations “checking the box” to assure themselves that they are compliant with respect to some framework. Our viewpoint is: if we do a very good job with information security (at a technical level and at a process level), compliance more or less takes care of itself.

In general, controls can be classified in three categories: preventive, detective, and corrective. In general, the most important ones are the preventive controls, which are put in place to prevent an attack or mitigate a threat. You need to keep in mind that it is very difficult to undo the damage when sensitive data is leaked or exposed. This is why we put significant emphasis on preventive controls and prioritize prevention. At the same time, we have to always be prepared for the possibility that data might be leaked or exposed, which is where detective controls come in handy, i.e. the sooner we can detect an intrusion or malicious attack, we can minimize potential damage, as opposed to detecting the event weeks or months later.

How does the security of HPCC Systems compare to the threat of other big data systems like Hadoop?
[HPCC Systems] is a lot better. We have been doing this for 20 years, and we built HPCC Systems specifically to support our business. As a result, many of the requirements that we have in our business around security and privacy are also incorporated into the HPCC Systems platform.

By contrast, Hadoop was designed from the ground up to allow people to store massive amounts of data on relatively inexpensive hardware and then be able to perform searches like the "find the needle in a haystack" type of problem. This is what it has been focused on for the longest time, rather than on making it work securely. So the security on Hadoop came as an afterthought, and even the basic authentication mechanisms weren't deployed until a couple of years ago.

I saw that HPCC Systems went open source in 2011. Does that cause any vulnerability issues?
Not at all! On the contrary, this increases security through transparency and collaborative development. Generally speaking, the open source movement – started back in the 80s – is about releasing not just the compiled (or binary) version of the software, but also the programming language version of the software, the source code from which the binary code is generated. Rather than making things less secure, the increased number of reviewers and contributors can identify and correct security issues much faster with their combined efforts, making HPCC Systems even less vulnerable.

When legacy systems are converged onto the HPCC Systems platform, are there any concerns that one needs to be aware of? Some leading journals suggest that technology has progressed so quickly that legacy systems may have issues with merging into a new platform?
It's true that technology has changed, and that it changes very rapidly. It’s no longer a time where we have a new generation of technology every 20 years. Now, a new generation happens every two or three years.

These days, big data encompasses many things: social media, videos, free text – which is not well supported by legacy systems. When you’re trying to deploy HPCC Systems in that environment, there are two ways you can do it. You can say, “Well, I’m going to phase out all my legacy systems completely and move all the data,” but that might not be viable for many companies since they may need to continue to operate while they do that migration, so integration is needed. As with any other integration process, there is some complexity, which could generate some potential security issues in between the interfaces, while you are trying to connect one system to the other and move data. Which is why, when we migrate legacy systems on to the HPCC Systems platform, we play close attention to the security controls that may need to be implemented or refactored as a function of the migration.

Do we ever have risks of losing data?
Well, the reality is that everyone does. There is the myth of complete security, and it’s just that, a myth. There is no way you can say, “I’m 100 percent exempt from having any security issues ever.” Of course, we believe, based on applying best-in-class security practices, having thorough and comprehensive monitoring and surveillance and having a mature set of processes that adapts to the ever changing security threat landscape, that we have a very low risk of losing data.

Maybe I’ve been watching too many political action and sci-fi shows lately, but I was watching 24 and my mind kind of races, which makes me ask: do we ever have anybody try to intentionally hack our systems to see if they can get in?
We don’t have any controls against aliens from outer space, but we do try to intentionally hack into our systems. We have security assessments and penetration testing and we regularly perform both, internally and externally. In addition to our own security experts - who are very well-trained and very knowledgeable of these practices - we also have third parties that we hire on a regular basis, to attempt to break into our systems.

Unfortunately, the number of hackers or wannabe hackers is very large, and you can bet they will be creative in trying to find new ways of breaking into your systems. So, if you’re not performing continuous surveillance and scanning for new threats and attack methodologies, it will potentially expose you in the long run.

What are some challenges that you see with protecting big data in general in the future? And what do you think we need to do to combat those threats?
First of all, we need to draw a distinction between security and privacy. I think the biggest challenges are going to be potentially around privacy, which is a very touchy subject because there is no universal concept of privacy. This distinction is necessary because some people often confuse a perceived lack of privacy with a lack of security.

What is considered acceptable privacy in the US might not be acceptable privacy in Germany or China. What’s privacy to us today is not the type of privacy we used to consider 20 years ago, and it won’t be the privacy 20 years from now. It’s important to have a clear understanding of what the society accepts as privacy, to ensure that you don’t go beyond that. You never want to be seen as creepy, and I can’t define exactly what creepy is, but you will know when you see it.

There can also be better education. For example, when you go to install an application on your smart phone, and the list of permissions pops up, the list is so long, you probably don’t even read it. You say, “I need the application, so accept.” Well, I don’t think that is the right way of doing it. There needs to be some bullet points, saying, “Look, you give me your data, and for giving me your data, I will use your data in this way.” It needs to be clear and understandable by anyone.

At the end of the day, there needs to be an exchange of value between the data owner (the person) and the data user, and that value needs to be measurable and tangible. I am glad to allow others to use my data, if that gives me access to better credit, simplifies my access to online services and makes my children safer in the streets.

Adventures in GraphLand V (Graphland gets a Reality Check)

If you are familiar with other graph systems and have you not read any of my previous blogs please do not read this one. If you do you will rapidly convince yourself that for some inexplicable reason KEL is far more complex than all the other systems. This is not actually true as the early blogs show KEL can play simple just like the other systems; however KEL is also designed to tackle the real world. Today we are going to abandon graphs completely, go back to square one, and ask ourselves the question: “what do we actually know.”

To ask this question I am going to go back to one of our previous datasets (the person dataset) and reproduce it the way it would usually turn up:

This is a lot like the dataset I showed you last time; still has unique IDs, still has names and still has ages. The difference is that in real world data we cannot assume that there is one, single, unique value for each property that an entity has. There are really four distinct cases you need to think about:

  1. There is really, one true value for a given property – and someone has handed it to us
  2. There is really, one true value for a given property – and we have been given multiple possible values for it
  3. There are really multiple possible values for a given property
  4. There are really multiple possible values for a given property and we have been given multiple possible values for each of them

#1 is the case we have been dealing with until this point. It is the one that most simple systems will focus upon but outside of sample data it very rarely occurs unless you have created a system to perform significant knowledge engineering before you hand it to the system you intend to use for knowledge engineering. As we have already seen; KEL can work in this mode but it is really designed for the harder problem.

#2 is the case of something like ‘Age’ – each person has a particular age at a given point in time – but not all the reports of that age will (necessarily) be modern and they will thus not all be the same.

#3 and thus #4 speaks to questions such as: “which states have you lived in” or “which music do you like” – a single person could easily have multiple values for that again recorded over a length of time and with different encoding schemes.

So how does all of this relate to our favorite graph language? Way back in the beginning I introduced the ENTITY statement that allowed you to declare which properties a particular entity had. What I did not discuss was the fact that in that same entity declaration you can specify the MODEL (or behaviors) of the individual properties and I did not dig too deeply into some of the ‘weird stuff’ in the results we were getting back. Now is the time to dig in. First a simple piece of KEL -

Person := ENTITY(FLAT(UID,Name,INTEGER age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

And here is the output (on our new dataset)

A few things to observe:

  1. There is still one row for each entity (UID) even though for some of them we have multiple input rows
  2. Each ‘property value’ (age and name in this case) has its own list of values (child dataset for those of you familiar with ECL).
  3. Each unique value for a given field is only stored once. Thus in our input dataset UID=3 has three instances of the name ‘HELEN’ but it only appears once; however it has a record count of 3
  4. Each ‘property value’ has a record count and also the complete entity has a record count (and the end of the row). Thus UID=4 has a record count of 1 for MATT, 2 for MATTHEW and a total count of 3

Whilst you probably think this all makes sense; here is a question for you – according to the data above how many 16 year olds are there in the data? How many 18 year olds?

To find out the KEL answer is remarkably easy:

Person := ENTITY(FLAT(UID,Name,INTEGER age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person{age,COUNT(GROUP)};

QUERY:Find(_age) <= Person(_age);

This shows a new KEL capability; inter entity aggregation. Previously when we have seen the {} syntax the UID has always been inside the {}. When the UID is there we are simply projecting and aggregating inside each entity. With the UID removed, we are still projecting and aggregating, but it is the entities themselves we are working with.

So; in the syntax above I am grouping by each value of age and counting the number of entities within each group. Here is the result from our test data:

To the specific question; KEL counts UID=4 as being both 16 and 18. UID=1 and UID=3 are both counted as being 46 although both are also counted as being 0.

In KEL a multi-valued property has all of the values (once).

If you use the ‘Find’ query above and pass in 16 you will get back UID=4 and UID=5.

I suspect that by this point; some of you are shaking your head, somehow something is wrong. Somebody cannot have two different ages at the same time!
But suppose we were talking about phone number? Or ‘state lived in’? Suddenly having multiple values for that field would be considered perfectly reasonable.

Of course it comes down to the four groups I mentioned at the beginning; for each field the KEL programmer has the opportunity to express “how many values of this property should there really be?”. By default KEL assumes the data it is handed is correct; we have provided multiple values for age and therefore it has treated it as a multi-valued property!

Is this repairable within KEL? Yes – with logic statements. Here is a rework of the previous code in which the KEL programmer has begun to do the work of computing single values from multiple reports (case #2).

Person := ENTITY(FLAT(UID,Name,INTEGER in_age=age));

USE File_Person(FLAT,Person);

Person: => Age := in_age$Max;

QUERY:Dump <= Person{UID,{in_age},Age,{Name}};

In the first line I have now moved the ingested data into a multi-valued field call in_age. KEL does not enforce this but I recommend using a consistent naming convention for ‘things you are bringing in that you know have the wrong model’.

Now, you should recognize the logic statement in the middle and you may even be able to read it. I want to read it with ‘new eyes’.

Firstly the scope is

Person:

This means all of the values for a given entity are in scope for this logic statement.

Age :=

Is now creating a new single value

in_age$Max

finds the maximum of all the ages within the entity (assuming ages only ever go up; which is true for everyone other than my mother).

The query statement has a new feature – some of the columns are inside extra {}. This means ‘put out a multi-valued property AS a multi-valued property’. The result is:

This looks a lot like the previous except we now have an Age column that is single valued per entity and behaves exactly as the nice case we handled in the first four blogs. In the ‘real world’ you would not use the in_age property once you had produced the clean version of it.

We have seen that KEL can handle the clean case like everyone else and it actually has a mechanism to handle the dirty case with full control for the user. Now for some magic: KEL has the ability to handle the dirty case as if it were the clean case: automatically.

Person := ENTITY(FLAT(UID,Name,INTEGER Age),MODEL(*));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

The new piece here is the MODEL statement. MODEL(*) is identical to MODEL(UID,NAME,Age), it means ‘every field is a single valued field.’ If we ingest the data using this model we get this:

Every field value has been hammered down to one single value and everything can be treated ‘nicely’. The cost of this automated magic is that some data has been ‘scrunched’. If you look at the table above the _name_flags and _age_flags actually tell you when the scrunching has occurred. The ‘2’ values are those places where multiple values were hammered into one. KEL has also generated a ‘SanityCheck’ attribute that will give complete stats about how good (or bad) the automated scrunching was.

Whether or not you want automated scrunching will depend upon your own character and the nature of the data and in particular it can depend upon the nature of the individual fields. This can be captured in the KEL. As an example; suppose we consider it reasonable to have multiple names but only one age.

Person := ENTITY(FLAT(UID,Name,INTEGER Age),MODEL(UID,{Name},Age));

USE File_Person(FLAT,Person);

QUERY:Dump <= Person;

Producing

Using this mechanism you can force ‘nice world’ and then selectively improve the conversion as you see fit.

For some the foregoing might have been a little dull, for others it was probably a little scary. I am, however, unrepentant. On many systems GraphLand would be a fantasy world of amazing things largely detached from the reality of large scale Big Data analytics. GraphLand is not designed to be a fantasy world; it is designed to bring large scale graph analytics to reality.

Adventures in Graphland Series
Part I - Adventures in GraphLand (Part I)
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

Adventures in GraphLand IV (The Subgraph Matching Problem)

I once opened a meeting of extremely senior folks all of whom specialized in large scale graph analytics by stating: “There is no such thing as an interesting large scale graph problem; but there are lots of huge graphs containing interesting small problems that we can’t find.” My statement was clearly hyperbolic and exceptions do exist although I would contend it is rather closer to the truth than most people would like to admit. For reasons covered under ‘The Wayward Chicken’ a huge highly connected graph begins to look a lot like a population average. Therefore if our graph analytics are really going to buy us anything we need to be able to narrow down the area of the graph across which we are working.

One of the most important graph narrowing mechanisms is the Subgraph Isomorphism problem or more informally “subgraph matching” problem. The core idea here is that I have a little graph in my hand of interconnected facts; I want to find a place (or places) in a much larger graph where that little graph appears. This seemingly simple ability is the engine behind our patented MEOW technology that maps fragments of entities in documents to social networks. It is the same idea that drives our advanced person search capabilities. In short if there was a single graph problem that we cared about being able to do well; it would be this one. This leads me neatly to the second major feature of the Subgraph Matching problem; the general case is NP complete. Put another way, it is excruciatingly difficult to tackle this problem well.

This all leads us to the obvious question; why have we progressed this far through our tour of dataland without touching upon the subgraph matching problem? The answer is twofold. Firstly was my slightly weaselly use of the phrase: ’in the general case.’ Whilst some sub-graph matching problems are horrible some are relatively benign. I believe that through judicious precomputation and intelligent query design it is possible to render many, many subgraph matching problems tractable. Secondly KEL is designed so that the “subgraph matching problem” (and most other graph queries) just ‘drop out’ once the logic elements have been defined. The purpose of the rest of this blog entry is to persuade you that this is the case.

For the following we will use the datasets defined in the previous three blogs. To start simply: “Find all the people that own or drive a Honda”:


QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') ) );

Here the outer level of our filter expression is ‘Person’; so we have already confined the search to person nodes. You can think of the ‘inside’ of a filter condition as: “go to the node and then try…” So for each Person node we will walk down PerVeh link if there is one, then follow the Veh entity and check if it is a Honda. A quick review of the Person->Vehicle table we produced in the last blog shows that the answer should be Helen and Matthew; running the code produces:

Suppose we decide we want to hunt for people that own or drive a Honda AND a car which is blue. A quick glance at the above code probably tells you that:


QUERY: Find <= Person( EXISTS( PerVeh.Veh(Make='HONDA') )
                      ,EXISTS(PerVeh.Veh(Colour='BLUE') ) );

Returns the one person able to satisfy both conditions.

COUNT can be swapped for EXISTS to ensure a given entity has multiple matching item. For example: “who has at least two older relatives?”


QUERY: Find <= Person( COUNT(Relationship.whoelse(Person.Age<Age)) >= 2 );

The only ‘hard’ part in the above is the Person.Age<Age; who is Person and who is ‘local’. Remember the way to read this expression is that you start at the node represented by the outer level, each time you meet () following and entity or association you are evaluating that node and you travel whenever you see a period. Each time you meet () following an aggregate word (COUNT, EXISTS, SUM etc) you are about to follow multiple paths to accumulate a result). So here we start at a given person node. We scan down each the relationships and we travel to each whoelse in turn. We then evaluate that whoelse node. So inside those () the ‘Age’ refers to the local node (the whoelse). To find the age of the original person we were scanning we need to use Person.Age. Thus to encode ‘has an older relative’, the relative has an Age > Person.Age.

Using exactly these techniques it is possible to specify an arbitrary (constant) subgraph and return all the matches to it. The slightly tedious part is that you have to write new code each time; bit of a shame you can’t just parameterize everything. Of course you can:


QUERY: Find(_Age,Name2) <= Person( _Age, Relationship.whoelse(Name=Name2) );

Illustrates a number of language features:

  • You can parameterize queries; in 0.4 it simply means that a parameterized module is created to execute the query. In 0.5 it will also cause keys to be built and a roxie service to be available.
  • The _Param case is special; it is for the very common case where you wish to filter an entity based upon the equivalent of a property value. Person( _Age…. ) means ‘of all those people with Age=_Age value’
  • The nested relationship query shows the more ‘regular’ use of a parameter (Name2).
  • For this query I have also made use of the implicit EXISTs. A dataset used in a Boolean context creates a false if the dataset is empty.

The foregoing hopefully illustrates that the subgraph matching problem can be specified within KEL; what we have not done is made any use of the properties that we have spent three blogs manufacturing. So here is a harder question: “which people co-own a vehicle?”


QUERY: Find <= Person( PerVeh.Veh(Person.PerVeh.Type='OWNS', nOwners > 1) );

Only contains things you have seen before. Notice the Person.PerVeh.Type to pull information from up-path and the nOwners that behaves as if it belongs to vehicle (even though it was manufactured by travelling to other nodes). The latter point is crucial; one of the driving costs of the NP complete nature of the subgraph matching problem is the complexity of the query. Having given ourselves a way to render the run-time part of the query simpler we have significantly improved performance.

That about wraps up the subgraph matching problem; I do however wish to finish with a warning:

We have seen the code to find people with two older relatives. I suspect that if I were to set an exercise to find people that have access to at least two vehicles you would produce:


QUERY: Find <= Person( COUNT(PerVeh.Veh) > 2);

Unfortunately the answer returned would be wrong. David has two relationships with vehicle 2 and thus when the full path PerVeh.Veh is evaluated for ‘David’ three combinations in total are found (one for vehicle 1, two for vehicle 2). I could in this case, quite rightly, simply dismiss the problem as an evil of late typing (which it is). There are legitimate cases though where you need to know the end-points you have reached on a graph walk independent of the route taken to get there. Here is the syntax you need; what it means and how it works in the subject of a later blog.


QUERY: Find <= Person( COUNT(PerVeh{Veh}) > 2 );

Adventures in Graphland Series
Part I - Adventures in GraphLand (Part I)
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

Adventures in GraphLand III (The Underground Network)

If you have digested my blog on The Wayward Chicken then as you gaze out upon GraphLand you should no longer be thinking of a land of clearings connect via a maze of pathways. Rather we have a modern industrial landscape; every node is a factory with a rail link to the neighboring factories. Raw goods (data) regularly travel towards each node to be processed into a more refined good which may in turn become the raw material for a neighboring factory. Conceptually, (though not factually) the process happens for every node synchronously until everything is complete. Only then are orders serviced and that process usually just amounts to assembling the parts already made (quite possibly all within a single factory).

Our scene readily maps onto pure graph theory which would like us to believe the world consists of nodes and edges (balls and sticks) and that everything can be mapped onto this model. In truth it can; but it requires a shift of mind that I hope will repulse you as much as it does me.

In our opening data model we had people and relationships between people and all our data fitted nicely into the model. How do I add the concept of ‘Vehicles’ to GraphLand? To stick with my stick and ball model I now have to give every node an extra property the ‘Type’ which would be ‘Person’ or ‘Vehicle’. I then have to add vehicle properties to my node (which will be null for people). I now have to adjust all of my processing logic to ensure I am filtering (at process time) down the types of entity I am interested in.

In fairness there are benefits to this ‘model’; it is a data model which by definition can handle anything. The cost is that the logic around how to process the entity types has to be pushed down into the queries and worse all of the resolution and checking happens at run time. To return to our analogy; every factory could process different things a different way and at different times and we need to enter each factory to find out what it does. This could actually be a motivator behind the Wayward Chicken programming method.

The KEL model, and thus GraphLand, behaves very differently. In our first blog I introduced Entities and Associations and then used the term interchangeably with nodes and edges. This is not strictly correct. Within KEL you can define multiple entity types and multiple types of associations inter or intra entity. You can think of this as having multiple layers of factories burrowing underground with various pipes and tubes carrying goods from layer to layer. Within a single layer everything is homogenous and synchronous and the transport between layers is also homogenous and synchronous but the result is that complex heterogeneous queries can be achieved.

Enough theory – let’s get to the data.

First with apologies to all car buffs out there; here is the Bayliss interpretation of all interesting facts about our vehicles.

Secondly, relationships file between people and vehicles. Note this file contains information regarding two different types of relationship between Person and Vehicle, Owns and Drives. The ‘per’ column is the entity ID for the person, the ‘veh’ column is the entity ID for the Vehicle.

These can be declared and used within GraphLand as easily as before:

It should be noted that if I personally were doing this I would have split my relationship file into two and imported one as an association ‘owns’ and one as ‘drives’; I would then have had rather nicer association names, compile time checking and nicer queries. However, for the sake of showing off some KEL features, I am bringing in the associations with the types intact and we will validate them at query time.

We may remember from our previous blog that associations for a given entity are now accessible as if they were child records of each entity. Thus the PerVeh association now exists for people and for vehicles.

This simple query provides a list of people with the vehicles they drive or own:


QUERY:Dump1 <= Person{UID,Name,
                PerVeh.Veh{UID,Make,Colour}};

Produces

Of course we do not really want to use KEL simply to dump out data we already had. What we really want is to compute new information. So here is a question for you: “How many vehicles is each person in the list the sole owner of?”

I would imagine that most of you started:

COUNT(PerVeh.Veh(< Ouch, this is going to hurt> ))

You actually can make it work that way; but it is much better if you decompose the problem. The first question one should ask is: “Which vehicles have only one owner?” This becomes very simple:


Vehicle: => nOwners := COUNT(PerVeh(Type='OWNS'));

Within our underground network of vehicles it is very easy to count the number of people relationships. The new thing is that the associations can have values (such as type) and they can be part of filters too.

Then within our surface entity (people) it becomes fairly easy to compute the number of vehicles a person is the sole owner of:


Person: => nWholeOwned := COUNT(PerVeh.Veh(Person.PerVeh.Type='OWNS',nOwners=1)):

The COUNT and the nOwners=1 should be familiar. The slight ugliness is the checking that we are dealing with ownership rather than simply driving; you may remember my saying I didn’t like late-typing!

We have to check Type=’OWNS’; the problem is that the filter is in the scope of the vehicle which doesn’t have a Type attribute. KEL allows you to pull data from anywhere on the path to a data element; but you have to state the place in the path you are pulling the data from. Hence: Person.PerVeh.Type=’OWNS’.

As an aside: KEL is a language under development, we do have an issue open to allow a rather prettier form:


Person: => nWholeOwned := COUNT(PerVeh(Type='OWNS').Veh(nOwners=1));

Which works if a filter is resolvable before the full path is known.

Putting this all together we have:


Person := ENTITY(FLAT(UID,Name,INTEGER Age));

USE File_Person(FLAT,Person);

Vehicle := ENTITY(FLAT(UID,Make,Colour=Color));
PerVeh := ASSOCIATION(FLAT(STRING Type,Person Per,Vehicle Veh));

USE File_Veh(FLAT,Vehicle);
USE File_PerVeh(FLAT,PerVeh);

Vehicle: => nOwners := COUNT(PerVeh(Type='OWNS'));
Person: => nWholeOwned := COUNT(PerVeh.Veh(Person.PerVeh.Type='OWNS',nOwners=1));

QUERY:Dump <= Vehicle{Make,Colour, nOwners,
                               PerVeh.Per{Vehicle.PerVeh.Type,UID,Name}};

QUERY:Dump1 <= Person{UID,Name, nWholeOwned,
                               PerVeh.Veh{UID,Make,Colour}};

Dump (a dump of vehicles with people as the children) yields:

Notice the use of the path walking syntax to annotate the child person records with the information from the link.

Dump1 is then rather simpler:

There is much more fun we can have with this data as we inch into some queries that are necessarily second order. First though I want to tackle the sub-graph matching problem; or rather point out that we have actually tackled it already.

Adventures in Graphland Series
Part I - Adventures in GraphLand (Part I)
Part II - Adventures in GraphLand (The Wayward Chicken)
Part III - Adventures in GraphLand III (The Underground Network)
Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

JOINS...

Different Types of Joins


Matching records from multiple data sources is one of the fundamental operations you need to process data. As you would expect ECL makes it easy – but the number of options can be bewildering. The following aims to guide you through the different options, and explain when they are appropriate.

The "simplest" way to match records between two datasets is to pair each record in one dataset with each record in the other dataset, and then filter the pairs to extract the matches. This is what JOIN(,ALL) does. Unfortunately as well as being simple it is likely to be very slow unless the datasets are very small. If the first dataset has M records, and the second N, then it will typically execute in O(MN) time. Thankfully if the match condition contains any equalities then there are much better approaches.

One of these approaches is to sort both datasets by the equality fields, and then step through the two datasets in step. You have the overhead of sorting the datasets which is typically O(MlnM) + O(NlnN), and finding the matching pairs is typically O(M+N). This is generally much quicker than the exhaustive approach above, and is the normal approach used in Thor.

Roxie uses a different implementation by default – instead of sorting the two datasets it starts by creating a hash table from the second dataset. Then each row from the first dataset is looked up in the hash table to find the matches. With the current implementation the "right" dataset is still sorted (to simplify duplicate handling and many lookups), but the order of the left is preserved. If the right dataset is relatively small then this will be significantly faster than sorting both datasets. This type of join can be selected in Thor by adding a ",MANY LOOKUP" attribute.

If Roxie uses a LOOKUP join by default why doesn’t Thor? The limitation of a LOOKUP JOIN is that the entire right dataset needs to be held in memory. That is a reasonable assumption to make for Roxie, where the datasets are generally small, but it isn’t for Thor.

So far the discussion has assumed it is easy to compare a pair of rows, but it is only efficient if the rows are present on the same node in the cluster. Different methods are used in Thor to ensure the rows are present on the same node:

  • ALL joins and LOOKUP joins clone all the rows from the right dataset onto each of the nodes. This is only really efficient if the right dataset is reasonably small.
  • The sorted joins normally globally sort the first dataset, and use the distribution of records created from the first sort to distribute the second dataset to match.
  • An alternative is the HASH join. This distributes both datasets by a hash of the equality fields, and then performs local sorts after the datasets have been distributed.

If the ECL user knows already that the rows from the datasets must be on the same node, then they can opt to use a LOCAL variant of the join to avoid any redistribution. Sometimes, if the distribution of the left dataset is already known, the code generator will automatically distribute the right dataset to match, and then perform a local variant of the join.[+]

So in summary…

  • If the join conditional has no equalities then you need to use ALL.
  • If the right dataset is small you can use MANY LOOKUP.
  • Otherwise use a global sort.
  • You can use HASH to change the way distributions occur. (Sometimes this improves the performance.)
  • If matching rows are already on the same nodes you can add ,LOCAL.

But what if you don’t know how large the right dataset is? Using a standard join will work whatever the size of the datasets, but if the right is small then a lookup join would be significantly more efficient. One way to optimize for small datasets is to use conditional code based on an estimate of the size of the dataset. Unfortunately it makes the graphs ugly and confusing, and has to be pessimistic to avoid jobs failing because of insufficient memory. Why can’t it be done automatically? That’s where the new Thor SMART join comes in.

The SMART join processing code starts off assuming the right dataset is small, but gradually falls back to implementations that can cope with larger number of records. It goes through the following sequence:

  • Initially it used a MANY LOOKUP, duplicating the rows onto each node.
  • If it runs low on memory it uses a hash of the equality fields to distribute each row to a particular node. It then distributes the left dataset to match that distribution and uses a LOCAL MANY LOOKUP join on each node.
  • If there is still insufficient memory for the local hash table, the implementation falls back using a local join which sorts both sides on the re-distributed datasets. (Essentially a HASH JOIN).

There is little penalty from using a SMART join, but potentially a significant advantage for small datasets. If you are joining to a dataset of unknown size then it is worth experimenting by adding ",SMART" to your join. It is quite likely to become the default join implementation for Thor in a future version.

(Note, a SMART join can’t be the default implementation of a MANY LOOKUP join because a smart join may change the distribution of the left dataset . The code generator tracks the distribution of each dataset and use the information to optimize the query – those assumptions and optimizations would be broken if the dataset was redistributed.)


Pseudo Joins


There are few examples of ECL usage which are effectively joining two datasets together, but it is cunningly disguised...

  • An implicit dataset lookup:

    otherDataset(myfield=searchvalue)[1].x

    This occurs when a dataset is being searched for a field that matches a value, and then extracting a field from the matching row. If the dataset is small you are much better off using a new DICTIONARY support. Otherwise it would be much better to JOIN against the dataset (using a LEFT OUTER join if there may be no matching element, and KEEP(1) if there might be more than one).
  • Filtering by existance in a set:

    myDataset(searchValue IN SET(otherDataset, value))

    This code is searching to see if a value is contained in another dataset. Again if the dataset is small a dictionary is likely to be much more efficient. If it is large then an INNER JOIN (with KEEP (1)) would often be more appropriate.

Other JOIN options.


Sometimes most rows match relatively few rows in the other dataset, but there are a few instances of the match condition which result in very large numbers of records (e.g., john smith). Many of the options on a join are there to help cope with these pathological matches.

  1. LIMIT(n)

    If there are more than n matches for a particular row then abort the job.
    Note: LIMITs and ATMOSTS are only applied to the portion of the join condition that uses equalities. (Often called the hard match condition in some of the documentation.)

  2. LIMIT(n, SKIP)

    If there are more than n matches for a particular row exclude any matches for this row from the output.

  3. ATMOST(n)

    If there are more than n matches for a particular row the treat it as if it didn’t match anything.
    It is very similar to LIMIT(n, SKIP), but the difference is that for a LEFT OUTER join ATMOST will generate an output record. This allows you to return information about rows that had too many matches in the results.
    It is worth adding ATMOST(1) to a LEFT OUTER join that can only match 1 record. It will allow the code generator to remove the join if it isn’t needed.

  4. ATMOST({cond1, cond2, .., condn},n)

    If there are more than n matches for a particular row, apply each of the extra optional conditions until the number of matches is below the threshold.
    This allows you to apply a fairly broad join criteria for most pairs of rows, but add extra criteria if the equality fields are very common (e.g., some last names are much more common than others, so add extra conditions on first name, or other criteria). It is often used when the match condition also includes an edit distance match function to ensure the number of pairs processed by the edit distance function is reasonable.

  5. KEEP(n)

    Only generate output records for the first n rows that match the full join condition.

There are a few other options which are detailed in the language reference manual, which can help in fairly specific circumstances, but generally you should only add options for good reason.
(See http://hpccsystems.com/download/docs/ecl-language-reference)

[+] This new optimization has occasionally caused problems where ECL programmers have added a DISTRIBUTED() operation into some ECL to remove a warning, but the distribution expression wasn’t correct. They had used DISTRIBUTED(dataset, a-arbitrary-expression) instead of DISTRIBUTED(dataset) which indicates the dataset is DISTRIBUTED in an unspecified way. The moral of the story – never lie to the ECL compiler or it might come back to bite you.

Adventures in GraphLand (The Wayward Chicken)

The other evening I was distracted from a lengthy debug session by a commotion coming from the chicken coop. I raced out the door fearful that some local predator had decided to attempt to steal some dinner. I discovered no such thing; rather ‘the girls’ were all agitating to get out of the coop. In my concentration I had forgotten that it was past the time they are normally allowed out for their daily forage. I immediately opened the coop door and with an explosion of feathers the girls hurtled from the coop in all directions across the yard; each one individually convinced that the particular location that they had chosen to run to was liable to yield the most and tastiest bugs. Last out of the coop stalked our rooster with a grim look on his face; he would be spending the next hour trying to round his charges into the same place so that he could protect them from predators.

This tale of bucolic bliss is not provided to amuse; rather it illustrates the behavior of most programmers when first presented with a graph to analyze. Some deep primal instinct somehow persuades the otherwise analytic mind that the best way to tackle a graph is to scamper all over the place collecting snippets of information as one travels. One of our top technologists responded to my first blog with: “How do I get as far as possible as quickly as possible? Can I go backwards and forwards along the same path? What if I run around in circles? What if I get trapped? How do I know if I wander so far that I’m about to fall off the map?” Being a top technologist his language was rather more erudite (shortest path, directedness, cyclicity, connectedness and centrality) but the intuitive meaning is identical.

I don’t know what causes programmers to do this but I suspect it may be related to the (false) Urban Myth that everyone is related to Kevin Bacon by six degrees of separation. As part of a drastically over affirmed society our take-away from this myth is that there is yet another reason why we are special. I would suggest that if everyone is within six degrees of Kevin Bacon then the correct take-away should be that there is no significance to being within six degrees of Kevin Bacon.

Rather more mathematically; let’s assume that everyone has 25 close associates (relatives, co-workers, neighbors, in-laws) and as a first approximation let’s assume that the graph spans out evenly and without cycles. Then by six degrees of separation you have 25^6 = 240M people; around the adult population of the US. If I were to say: “There are two million criminals in the US, you are in the US, therefore you might be criminal” people would think I were crazy. If I were to say that I had computed that two million of your sixth degree relatives were criminal; then you would be in for questioning. The reality is that the two statements are essentially the same.

Even if we narrow our sites far further, if you view ‘relatives’ as a social network (25 people) rather than pure blood (3-9 people) then by third degree relationships you have fifteen thousand people; you would expect around 150 of them to have done jail time simply by random chance. At a statistical level if your graph has a branching factor of 25 then a fact regarding a second degree relative is twenty five times less significant than one relating to a first degree relative.

As such: the best and most significant information is always close to home.

This insight is one of the core drivers behind the KEL programming philosophy. Whilst some graph languages proudly declare they are ‘graph traversal’ languages; KEL is a knowledge engineering language. Specifically:

KEL exists to concentrate knowledge from the graph into the entity (not to scatter the query from the entity to the graph).

To illustrate the difference I want to tackle a question for the data presented in the last blog: “how many of your first degree relatives are the youngest person they know?” To make this easier to visualize here is the graph:

To answer the given question for ‘David’ here is a plausible graph traversal approach:

If you execute the above you will visit the following nodes in sequence:

So a query about David causes us to visit 15 nodes. As an exercise, you should try writing out the nodes visited if I wanted to know the average number of youngest relatives for the two oldest people in the graph. You should visit 29 nodes.

The KEL methodology is rather different:

Pleasingly it is rather easier to read in KEL than it is in pseudo-code!


Person: => Youngest := COUNT(Relationship(whoelse.Age<Age))=0;
Person: => YoungRels := COUNT(Relationship.whoelse(Youngest));

This is not a language tutorial but there are a couple of aspects of the above worth noting:

  • The ‘Person:’ defines the scope of the operation. It applies to Person nodes.
  • The space between the ‘:’ and the ‘=>’ allows a precondition for the operation to apply; that will be a later subject.

    The interesting piece occurs after the ‘:=’ so we will look at the first in detail:

    • For a given entity all of the associations are available as if they were a child dataset with the label of the association.
    • Implicitly, when in the scope of an entity, the references to any associations fix the first matching entity in the association to self.

    So,

    COUNT(Relationship)

    in the person scope would give you the number of relationships for each person.

    For this particular computation we want to filter those associations further, to those in which the OTHER person in the relationship is younger. We do this using: (whoelse.Age<Age). If there is no-one who is younger than you are then you are declared to be the youngest.

    Subsequent to this line every Person node has a new property available ‘Youngest’ which can be queried as used as any other property. Thus in the second line, we can compute the number of youngest relatives simply by counting and using the Youngest property as a filter.

    The nodes visited to compute the query for David are:

    Er, which is 22! That is 7 more visits that the normal method! Why would I go to such pains to make my life harder? Well, now trace through how many extra visits do I have to make to compute the average of the two oldest people? The total should be 26; a savings of 3. Now suppose I want to know the average number of youngest relatives across the whole graph; suddenly the KEL solution begins to pull substantially ahead.

    So what is happening? In the traditional model, each computation for each entity moves away from the entity to the first degree relatives which in turn have to move to their first degree relatives to compute the answer. Thus most of the computations are performed at the second degree from the original entity being considered.

    If viewed programmatically, within the KEL model this second degree query is written as two subsequent first degree queries. Of course, the KEL compiler may or may not choose to execute the query that way, and it will almost certainly have different strategies for thor and roxie; but logically that is how it is specified.

    We should not however, view KEL programmatically. Rather we should view KEL, or at least the KEL logic statements, as a graph annotation model, or better a knowledge concentration model. With each logic statement, knowledge that could be gleaned by traveling away from a node is moved and stored within the node itself. Queries then execute against the knowledge rich graph that is produced.

    There are, of course, examples when a graph query cannot be decomposed into a succession of simpler queries. Those are slightly harder to code and significantly harder to execute; they will be the subject of a later blog. For now just keep in mind; if you can code it as a succession of near-relative encounters then you should do so. In the mean time we need to tackle heterogeneity; what do you do if not all of your entities are the same type? That will be the subject of our next blog.

    In closing I would like to present our full ‘program’ as we have it so far; and the answers:


    Person := ENTITY(FLAT(UID,Name,INTEGER Age));
    Relationship := ASSOCIATION(FLAT(Person who,Person whoelse));

    USE File_Person(FLAT,Person);

    USE File_Relat(FLAT,Relationship(who=person1,whoelse=person2)
                                 ,Relationship(who=person2,whoelse=person1));

    Person: => Youngest := COUNT(Relationship(whoelse.Age<Age))=0;
    Person: => YoungRels := COUNT(Relationship.whoelse(Youngest));

    QUERY:Dump1 <= Person{UID,Name,Age,Youngest,YoungRels
                                 ,Relationship.whoelse{Name,Age,Youngest}};

    Produces:

    Incidentally: if anyone has the time and inclination to tackle these queries in alternate graph languages I would be very interested to see the results. I would also love to see if someone wants to have a swing at this in ECL; it will help me keep the KEL compiler writer on its toes.

    Adventures in Graphland Series
    Part I - Adventures in GraphLand (Part I)
    Part II - Adventures in GraphLand (The Wayward Chicken)
    Part III - Adventures in GraphLand III (The Underground Network)
    Part IV - Adventures in GraphLand IV (The Subgraph Matching Problem)
    Part V - Adventures in GraphLand V (Graphland gets a Reality Check)

    Tips and Tricks for ECL -- Part 1 -- Bit Fiddling

    I get fairly frequent emails asking, "How can I ... (do something) in ECL?" I've saved the example code from many of these questions, so now I'll begin sharing them through this blog.

    If anybody has this type of question they'd like me to answer (in the blog), please just send me an email to richard.taylor@lexisnexis.com and I will do my best to come up with an ECL example demonstrating the solution.

    All the code examples in this series will be stored in the "BlogCode" directory of my repository. So you will see IMPORT BlogCode used frequently.

    We'll start now with several questions I've gotten about "bit fiddling" in ECL.

    A Couple of Useful Functions

    The first thing to know about "bit fiddling" in ECL is that the bitwise operators will all only work on integer data types. Therefore, all the following examples are all designed to work with an UNSIGNED4 bitmap, although you can certainly use them with smaller bitmaps if you need to.

    We'll start with a couple of functions I wrote that are generally useful when you're working with bits:

    EXPORT Bit(UNSIGNED4 Bitmap,UNSIGNED1 WhichBit) := MODULE
      SHARED Mask := (UNSIGNED4)POWER(2,WhichBit-1);
    
      EXPORT BOOLEAN   IsOn := BitMap & Mask <> 0;
    
      EXPORT UNSIGNED4 Flip := Bitmap ^ Mask;
    
    END;

    I created a MODULE structure to contain these, since they are related functions and both take the same two parameters. They also both use the same expression, so I was able to simplify the code by defining the SHARED Mask definition used by both EXPORT functions.

    For the SHARED Mask definition, the POWER function raises the base value (2) by the exponent defined by WhichBit-1. This makes use of the mathematical fact that an exponent of zero is always 1, an exponent of one always returns the base number, and every other exponent will create the appropriate value so that the Mask (as a decimal value) will always be the value 1, or 2, or 4, or 8, or 16, etc. so that the only bit turned on in the resulting UNSIGNED4 Mask is the one bit in the ordinal position specified by the WhichBit parameter.

    The IsOn function returns a BOOLEAN indicating whether the specified bit is a 1 (on) or 0 (off). It starts by detecting an empty bitmap and immediately returning FALSE in that case, otherwise it will use the Bitwise AND operator (&) with the passed in Bitmap and the Mask to determine if the specified bit is on or not.

    The Flip function returns an UNSIGNED4 as the new bitmap after it has flipped the value of the specified bit. If the passed in bitmap as a 1 (on) in that bit it will change it to 0 (off), and vice versa. It uses the Bitwise eXclusive OR operator (^) with the passed in Bitmap and the Mask to flip just the value of the specified bit, leaving all other bits as they were.

    Here's an example of how these are used. This can be run in any builder window:

    IMPORT BlogCode;
    UNSIGNED4 MyBitmap := 4; //or in binary -- 00000000000000000000000000000100b
    
    BlogCode.Bit(MyBitmap,1).IsOn;  //returns FALSE
    BlogCode.Bit(MyBitmap,2).IsOn;  //returns FALSE
    BlogCode.Bit(MyBitmap,3).IsOn;  //returns TRUE
    
    NewBitMap := BlogCode.Bit(MyBitmap,2).Flip; //turn on second bit 
      //making this -- 00000000000000000000000000000110b
    
    BlogCode.Bit(NewBitMap,1).IsOn;  //returns FALSE
    BlogCode.Bit(NewBitMap,2).IsOn;  //returns TRUE
    BlogCode.Bit(NewBitMap,3).IsOn;  //returns TRUE

    ADDENDUM

    When David Bayliss reviewed the above code, he said that I had "just inserted a floating point operation into low-level bit twiddling. This is HORRIBLE." So he suggested an alternative method that just uses the bitshift left operator (<<) along with the Bitwise AND (&) and Bitwise OR (|) operators, like this:

    EXPORT Bit(UNSIGNED4 Bitmap,UNSIGNED1 WhichBit) := MODULE
    
      EXPORT BOOLEAN   IsOn :=   Bitmap & (UNSIGNED4)(1 << WhichBit-1) <> 0;
    
      EXPORT UNSIGNED4 Flip := Bitmap | ((UNSIGNED4)1 << WhichBit-1);
    
    END;

    This method removes the need for that POWER function.

    So What Can You Do With Bitmaps?

    Someone emailed me with this request:

    "Thought you might have a trick up your sleeve to attack this type of encoding. Basically an integer value is used to represent a list of values where each bit represents a separate item.

    "In the example below, each bit set represents an index into a string to get yet another code.

    // Valid codes  " VURQLKGFEDCABIJOWY",   
    // Position      1234567890123456789
    // Number                  position              Code
    //   32 = 2^5,              5  + 1                L
    // 4096 = 2^12              12 + 1                A
    //  512 = 2^9               9  + 1                E
    //  544 = 2^5 + 2^9                               LE

    "How do we write ECL code to translate a number to a code?"

    So here's the first code I wrote to handle this specific problem:

    IMPORT $;
    
    EXPORT Bit2Code(UNSIGNED4 B, STRING32 Codes) := FUNCTION
    
    		Code32 := IF($.Bit(B,32).IsOn,Codes[32],'');
    		Code31 := IF($.Bit(B,31).IsOn,Codes[31],'');
    		Code30 := IF($.Bit(B,30).IsOn,Codes[30],'');
    		Code29 := IF($.Bit(B,29).IsOn,Codes[29],'');
    		Code28 := IF($.Bit(B,28).IsOn,Codes[28],'');
    		Code27 := IF($.Bit(B,27).IsOn,Codes[27],'');
    		Code26 := IF($.Bit(B,26).IsOn,Codes[26],'');
    		Code25 := IF($.Bit(B,25).IsOn,Codes[25],'');
    		Code24 := IF($.Bit(B,24).IsOn,Codes[24],'');
    		Code23 := IF($.Bit(B,23).IsOn,Codes[23],'');
    		Code22 := IF($.Bit(B,22).IsOn,Codes[22],'');
    		Code21 := IF($.Bit(B,21).IsOn,Codes[21],'');
    		Code20 := IF($.Bit(B,20).IsOn,Codes[20],'');
    		Code19 := IF($.Bit(B,19).IsOn,Codes[19],'');
    		Code18 := IF($.Bit(B,18).IsOn,Codes[18],'');
    		Code17 := IF($.Bit(B,17).IsOn,Codes[17],'');
    		Code16 := IF($.Bit(B,16).IsOn,Codes[16],'');
    		Code15 := IF($.Bit(B,15).IsOn,Codes[15],'');
    		Code14 := IF($.Bit(B,14).IsOn,Codes[14],'');
    		Code13 := IF($.Bit(B,13).IsOn,Codes[13],'');
    		Code12 := IF($.Bit(B,12).IsOn,Codes[12],'');
    		Code11 := IF($.Bit(B,11).IsOn,Codes[11],'');
    		Code10 := IF($.Bit(B,10).IsOn,Codes[10],'');
    		Code09 := IF($.Bit(B,09).IsOn,Codes[09],'');
    		Code08 := IF($.Bit(B,08).IsOn,Codes[08],'');
    		Code07 := IF($.Bit(B,07).IsOn,Codes[07],'');
    		Code06 := IF($.Bit(B,06).IsOn,Codes[06],'');
    		Code05 := IF($.Bit(B,05).IsOn,Codes[05],'');
    		Code04 := IF($.Bit(B,04).IsOn,Codes[04],'');
    		Code03 := IF($.Bit(B,03).IsOn,Codes[03],'');
    		Code02 := IF($.Bit(B,02).IsOn,Codes[02],'');
    		Code01 := IF($.Bit(B,01).IsOn,Codes[01],'');
    	  RETURN TRIM(Code01 + Code02 + Code03 + Code04 + Code05 
    	            + Code06 + Code07 + Code08 + Code09 + Code10 
    	            + Code11 + Code12 + Code13 + Code14 + Code15 
    	            + Code16 + Code17 + Code18 + Code19 + Code20 
    	            + Code21 + Code22 + Code23 + Code24 + Code25 
    	            + Code26 + Code27 + Code28 + Code29 + Code30 
    	            + Code31 + Code32,ALL);
    END;

    This function takes two parameters: and UNSIGNED4 bitmap, and a STRING32 containing the valid codes for each position. This is a simple "brute force" approach that will test each bit in the bitmap and either take the code value for that position or a blank string. The TRIM function is using the ALL option to remove all spaces from the concatenated result string.

    You can test the function in a builder window like this:

    IMPORT BlogCode;
    
    ValidCodes := ' VURQLKGFEDCABIJOWY';
    
    output(BlogCode.Bit2Code(32,ValidCodes),named('Bit2Code_32'));		  //L
    output(BlogCode.Bit2Code(4096,ValidCodes),named('Bit2Code_4096'));	//A
    output(BlogCode.Bit2Code(512,ValidCodes),named('Bit2Code_512'));		//E
    output(BlogCode.Bit2Code(544,ValidCodes),named('Bit2Code_544'));		//LE
    output(BlogCode.Bit2Code(10000000000000000b,ValidCodes),named('Bit2Code_10000000000000000b'));	//O
    output(BlogCode.Bit2Code(10000000010000000b,ValidCodes),named('Bit2Code_10000000010000000b'));	//GO

    Once I had solved his specific problem, I decided to expand the solution to be a more generic tool. Instead of using the bitmap to indicate a set of single letter codes, why not have it indicate a set of strings of any length (up to 4K, in this example)? That's what this next version does:

    IMPORT $;
    
    EXPORT Bit2String  := MODULE
      EXPORT Layout := RECORD
        STRING Txt{MAXLENGTH(4096)};
      END;
    
      EXPORT Func(UNSIGNED4 B, DATASET(Layout) Codes) := FUNCTION
    
        br := ROW({''},Layout);
    
    		Code32 := IF($.Bit(B,32).IsOn,Codes[32],br);
    		Code31 := IF($.Bit(B,31).IsOn,Codes[31],br);
    		Code30 := IF($.Bit(B,30).IsOn,Codes[30],br);
    		Code29 := IF($.Bit(B,29).IsOn,Codes[29],br);
    		Code28 := IF($.Bit(B,28).IsOn,Codes[28],br);
    		Code27 := IF($.Bit(B,27).IsOn,Codes[27],br);
    		Code26 := IF($.Bit(B,26).IsOn,Codes[26],br);
    		Code25 := IF($.Bit(B,25).IsOn,Codes[25],br);
    		Code24 := IF($.Bit(B,24).IsOn,Codes[24],br);
    		Code23 := IF($.Bit(B,23).IsOn,Codes[23],br);
    		Code22 := IF($.Bit(B,22).IsOn,Codes[22],br);
    		Code21 := IF($.Bit(B,21).IsOn,Codes[21],br);
    		Code20 := IF($.Bit(B,20).IsOn,Codes[20],br);
    		Code19 := IF($.Bit(B,19).IsOn,Codes[19],br);
    		Code18 := IF($.Bit(B,18).IsOn,Codes[18],br);
    		Code17 := IF($.Bit(B,17).IsOn,Codes[17],br);
    		Code16 := IF($.Bit(B,16).IsOn,Codes[16],br);
    		Code15 := IF($.Bit(B,15).IsOn,Codes[15],br);
    		Code14 := IF($.Bit(B,14).IsOn,Codes[14],br);
    		Code13 := IF($.Bit(B,13).IsOn,Codes[13],br);
    		Code12 := IF($.Bit(B,12).IsOn,Codes[12],br);
    		Code11 := IF($.Bit(B,11).IsOn,Codes[11],br);
    		Code10 := IF($.Bit(B,10).IsOn,Codes[10],br);
    		Code09 := IF($.Bit(B,09).IsOn,Codes[09],br);
    		Code08 := IF($.Bit(B,08).IsOn,Codes[08],br);
    		Code07 := IF($.Bit(B,07).IsOn,Codes[07],br);
    		Code06 := IF($.Bit(B,06).IsOn,Codes[06],br);
    		Code05 := IF($.Bit(B,05).IsOn,Codes[05],br);
    		Code04 := IF($.Bit(B,04).IsOn,Codes[04],br);
    		Code03 := IF($.Bit(B,03).IsOn,Codes[03],br);
    		Code02 := IF($.Bit(B,02).IsOn,Codes[02],br);
    		Code01 := IF($.Bit(B,01).IsOn,Codes[01],br);
    	  RETURN (Code01 + Code02 + Code03 + Code04 + Code05 
    	            + Code06 + Code07 + Code08 + Code09 + Code10 
    	            + Code11 + Code12 + Code13 + Code14 + Code15 
    	            + Code16 + Code17 + Code18 + Code19 + Code20 
    	            + Code21 + Code22 + Code23 + Code24 + Code25 
    	            + Code26 + Code27 + Code28 + Code29 + Code30 
    	            + Code31 + Code32)(Txt <> '');
      END;   
    END;

    This code uses the same "brute force" approach, but instead of building characters for a string, it defines the records that will go into the result record set. That means that instead of using the TRIM function to get rid of blank spaces, we simply append all the result records into a record set that we filter to eliminate the blank records.

    Testing this version is similar to the previous, but we pass the dataset of text values as the second parameter. Multiple bits turned on result in multiple records in the result set.

    IMPORT BlogCode;
    SetStrings := ['','The Voice','American Idol','The X Factor','Law & Order',                  
                   'Lost','Glee','The Daily Show','Revenge','Hart of Dixie',
                   'Walking Dead','True Blood','Sopranos','Game of Thrones',
                   'Downton Abbey','Poirot','Rizzoli & Isles','Suits','Swamp People',
                   'Pawn Stars','Firefly','AC 360','Fox & Friends','Hardball',
                   'Mike & Molly','60 Minutes','The Ellen Show','Elementary','Sherlock',
                   'Here Comes Honey Boo Boo','Doctor Who'];
    ds := DATASET(SetStrings,BlogCode.Bit2String.Layout);
    
    OUTPUT(BlogCode.Bit2String.Func(110011b,ds));		
    OUTPUT(BlogCode.Bit2String.Func(64,ds));		
    OUTPUT(BlogCode.Bit2String.Func(64928,ds));		
    OUTPUT(BlogCode.Bit2String.Func(0100000000000000000000000000000b,ds));

    Bitmapped Dates

    I was teaching an ECL class one day in Alpharetta, and we were having a discussion of the various ways you can store dates in ECL. Joe Keim came up with this solution.

    This function will squeeze a standard YYYYMMDD 8-byte date string into three bytes of storage in an UNSIGNED3 bitmap. This format has the added advantage that it will still be perfectly sortable, just as a YYYYMMDD string would be.

    EXPORT UNSIGNED3 Date2Bits(STRING8 YYYYMMDD) := FUNCTION
      YY := (UNSIGNED3)YYYYMMDD[1..4] << 9;
    	MM := (UNSIGNED3)YYYYMMDD[5..6] << 5;
    	DD := (UNSIGNED3)YYYYMMDD[7..8];
      RETURN YY | MM | DD;
    END;

    This Date2Bits function takes the first four characters of the string (YYYY) and casts that value into an UNSIGNED3 that is then shifted left 9 bits. The next two characters of the string (MM) are also cast into an UNSIGNED3 that is then shifted left 5 bits. The last two characters of the string (DD) are simply cast into an UNSIGNED3.

    Those three UNSIGNED3 values are then ORed together using the Bitwise OR operator (|) to create the resulting bit map, where:

    YYYY is in first 15 bits
    MM is in next 4 bits
    DD is in last 5 bits

    You can then use this Bits2Date function to reconstruct the YYYYMMDD string from the bitmap:

    EXPORT STRING8 Bits2Date(UNSIGNED3 d) := FUNCTION
      STRING4 YY := INTFORMAT((d & 111111111111111000000000b >> 9),4,0);
      STRING2 MM := INTFORMAT((d & 000000000000000111100000b >> 5),2,1);
      STRING2 DD := INTFORMAT((d & 000000000000000000011111b),2,1);
      RETURN YY + MM + DD;
    END;

    This function uses the Bitwise AND operator (&) against a binary mask, which is then shifted right 9 bits to produce the integer value for the YYYY. The INTFORMAT function then right-justifies that integer value into a 4-byte string with leading blanks. The MM and DD values are treated the same way, except their STRING2 results are formatted with leading zeroes. The final YYYYMMDD string is a simple concatenation of the three.

    You can test these functions with code like this in a builder window:

    IMPORT BlogCode;
    d1 := '19500827';
    
    UNSIGNED3 date1 := BlogCode.Date2Bits(d1);
    date2 := BlogCode.Bits2Date(date1);
    
    ds2 := DATASET([{'InDate',d1},
                    {'OutBitVal',Date1},
                    {'Re-Date',Date2}],
                   {string10 txt,string10 val});
    							 
    OUTPUT(ds2,NAMED('Bits2Date_' + d1));

    Viewing the Bitmap

    Sometimes you just want to see what the bitmap actually looks like. You can, of course, just output the integer value, but a bitmap is best viewed in binary representation. So when Peter Vennel needed to create a string representation of a bitmap, here's my code to do that:

    IMPORT $;
    
    EXPORT STRING32 Bit2Str(UNSIGNED4 Bin) := FUNCTION 
    
      STRING1 Bin2Str(UNSIGNED1 Bit) := IF($.Bit(Bin,Bit).IsOn,'1','0');
    
      RETURN Bin2Str(32) + Bin2Str(31) + Bin2Str(30) + Bin2Str(29) + 
             Bin2Str(28) + Bin2Str(27) + Bin2Str(26) + Bin2Str(25) + 
             Bin2Str(24) + Bin2Str(23) + Bin2Str(22) + Bin2Str(21) + 
             Bin2Str(20) + Bin2Str(19) + Bin2Str(18) + Bin2Str(17) + 
             Bin2Str(16) + Bin2Str(15) + Bin2Str(14) + Bin2Str(13) + 
             Bin2Str(12) + Bin2Str(11) + Bin2Str(10) + Bin2Str(9) + 
             Bin2Str(8)  + Bin2Str(7)  + Bin2Str(6)  + Bin2Str(5) + 
             Bin2Str(4)  + Bin2Str(3)  + Bin2Str(2)  + Bin2Str(1); 
    END;

    This simple function just relies on the previously defined Bit.IsOn function to generate either a "1" or "0" character in each bit position. The result is just the concatenation of all 32 characters, showing you exactly what the bitmap looks like.

    Test with this code in a builder window and you'll see the result:

    IMPORT BlogCode;
    
    val := 532860;
    
    OUTPUT(BlogCode.Bit2Str(6));   //00000000000000000000000000000110
    OUTPUT(BlogCode.Bit2Str(val)); //00000000000010000010000101111100

    Convert to Hexadecimal

    I got an email from Jarvis Robinson, who needed a way to display text and the corresponding Hexadecimal values side by side. So I wrote this String2HexString function to do that:

    EXPORT String2HexString(STRING DataIn) := FUNCTION
    
      STRING2 Str2Hex(STRING1 StrIn) := FUNCTION
        STRING1 HexVal(UNSIGNED1 val) := 
    		      CHOOSE(val,'1','2','3','4','5','6','7','8',
    				 '9','A','B','C','D','E','F','0');
        UNSIGNED1 Char1 := (>UNSIGNED1<)StrIn >> 4;
        UNSIGNED1 Char2 := ((>UNSIGNED1<)StrIn & 00001111b);
        RETURN HexVal(Char1) + HexVal(Char2);
      END;
    	
      Rec := {STRING Hex{MAXLENGTH(1024)}}; 
      ds := DATASET(LENGTH(TRIM(DataIn)),
                    TRANSFORM(Rec,
      	            SELF.Hex := Str2Hex(DataIn[COUNTER])));
      HexOut := ROLLUP(ds,TRUE,TRANSFORM(OutRec,SELF.Hex := LEFT.Hex + RIGHT.Hex));
      RETURN DATASET([{DataIn,HexOut[1].Hex}],
                     {STRING Txt{MAXLENGTH(1024)},
                      STRING Hex{MAXLENGTH(1024)}});
    END;

    This function starts with the Str2Hex function nested within, which will produce two hexadecimal characters for a single string character. It has its own HexVal function that returns a the Hex for a given value. The Char1 definition uses the (>UNSIGNED1<) type transfer shorthand syntax to treat the StrIn character as an UNSIGNED1 value, so that the Bitshift right operator (>>) will work (all bitwise operations can only be done on integer types). This simply moves the left 4 bits to the righthand nibble, making the UNSIGNED1 a value from zero through 15. The Char2 definition also uses the (>UNSIGNED1<) type transfer shorthand syntax to treat the StrIn character as an UNSIGNED1 value, so that the Bitwise AND operator (&) will work. This simply removes the left 4 bits, making the UNSIGNED1 a value from zero through 15. The return value concatenates the two hex characters into the STRING2 result.

    The key to this function being able to work with any size string (up to 512 bytes, as written, but you can modify that if needed), is the DATASET definition. This form of DATASET creates a new dataset with as many records as there are characters in the StrIn parameter. More importantly, the inline TRANSFORM populates each record with the Hexadecimal equivalent to each chacter in the StrIn.

    That leaves only the ROLLUP to composite all those records into a single one with all the Hex values concatenated into a single string. The inline form of DATASET then allows a nicely formatted one-record result to present both the input text and its hex equivalent as the result.

    Testing the code this way will show its usefulness:

    IMPORT BlogCode;
    STRING dat := 'efgh\tBCD'; //x'6566676809424344'
    
    OUTPUT(BlogCode.String2HexString(dat));

    Note that the input string contains a tab (\t) character, which can be difficult to see in text mode. But looking at the result in hex mode, you can clearly see the 09 hex representation of the tab.

    That's enough for this article. We'll continue with more useful code examples in subsequent articles.

    Contact Us

    email us   Email us
    Toll-free   US: 1.877.316.9669
    International   Intl: 1.678.694.2200

    Sign up to get updates through
    our social media channels:

    facebook  twitter  LinkedIn  Google+  Meetup  rss  Mailing Lists

    Get Started