I well remember my introduction to big data. We were at a customer site and they were taking us on a tour of their facility when we walked into the ‘sneakernet’. The sneakernet was a wonder to behold. Along the far wall was a gigantic diagram that was the dataflow of their enterprise. Between us and the wall was an array of forty cubes, much like many enterprise offices: until you noticed the Post-It Notes. Not just one or two Post-It notes; every cube had dozens of ‘Post-Its’ arrayed on boards that had clearly been designed to hold them. The diagram on the wall was also covered in ‘Post-Its’. Even more interesting there were about half a dozen people that were literally walking from cube to cube, depositing Post-It notes and removing them. I was baffled until the sneakernet, literally considered to be the heart of the data factory, was explained to me.
Today when most people discuss ‘Big Data’ they tend to focus upon Volume. Something is a Big Data problem if it can be measured in Petabytes; everything else is not really Big Data. However, this large company that was entirely focused upon data did not have a major volume problem. They did have a few tens of terabytes and back in 2000 that was a bit of a Volume issue – but it wasn’t the major issue. The major issues they had were the other four Vs: Variety, Variability, Velocity and Veracity.
The company was collecting data from thousands of different places; theoretically it was all in the same format although as one wry analyst observed “we have several hundred different interpretations of our standard format.” Once you add variety to a data problem the code complexity increases, the need for QA increases and the chances of requiring code modifications as part of the daily workflow increases.
Variety on its own would be bad; it is even worse when compounded to variability. A process that has been running flawlessly for years can suddenly break because a data source changes. This may be a deliberate change that can be scheduled, a deliberate change they mention after the fact, or they may just have screwed something up. Whatever the reason variability can mean that a downstream process fails because of un-trapped variability upstream.
Working with variety and variability is only painful; once you add velocity it becomes exacerbated. If dozens of files are turning up in any given hour and there are very tight SLAs for the data to be integrated and available then all the foregoing has to happen and happen fast. In this situation the pain of finding an error downstream is compounded by the need to re-run some of the dataflow to fix the error but wish to rerun as little as possible.
The pain goes from exacerbated to excruciating when you add in veracity: the absolute need for the data to be correct. In a typically ‘web log tracking’ or ‘analytics’ system the cost of an error here or there is not extreme. One has to fix it as soon as possible; but as soon as possible is soon enough. In a situation where the data at a detailed level is crucial to the financial or physical wellbeing of the individual then you simply have to fix any problem now.
The sneakernet was an expensive, if brutally effective solution to this problem. If code had to be changed then individuals plotted out what processes had to be re-run and the sneaker people took the flow orders encoded on PostIts to the job executers that sat in their cubes. If data arrived, or had to be re-worked or failed some QA test then again the sneakers would spring into action. The process cost manpower; but it allowed the company to run.
So – what does any of this have to do with ECL? Sneakernet was the driving force behind one of my favorite ECL features: PERSIST. PERSIST was recently described by one of my colleagues as ‘incredibly simple’, which is true but that is a good thing; not bad. PERSIST is a qualifier that can be added to any attribute and it marks a watershed in the data process.
When an attribute with a PERSIST qualifier is executed then the result at that point is saved to disk along with a checksum of all of the data that went into the attribute along with all of the code used to produce the result. Then when the attribute is used again the system first checks to see if any of the inputs (code or data) have changed: if they have then the attribute is recomputed – if not then the stored value is used.
Some of our large processes which contain hundreds of graphs, each of which contains the equivalent of dozens of map-reduces, will have dozens of persists within them. Whenever we want the result of the graph the system checks all the PERSISTS and only recomputed the bare minimum of the graph required to get the result. Further, PERSISTed attributes can be shared between multiple graphs. Therefore if one output builds the PERSIST from one source, then when the next job needs the input from that source it will already exist.
In short, in our system the ECL Agent wears the sneakers; it interacts with our meta-datasystems and the compiled code. If your only data problem is Volume then this may be overkill. But if any or all of the other four Vs begin to bite - check out PERSIST