ECL Feature of the Week: VALIDATE

When people first see ECL they usually spot a few programming languages lurking in the syntax: Prolog for the declarative definitions, Pascal for ‘:=’ and the modular syntax, C for the math expressions and SQL for the independence between logical and physical data mapping. Most people miss the language that drives two of ECLs more interesting features: SNOBOL4.

This is not particularly surprising; SNOBOL4 was released in 1967 and has withered to nothingness having been superseded by other languages with a more modern flavor. There are many good reasons for SNOBOLs demise; it has many weaknesses and idiosyncrasies. But SNOBOL did one thing very well; pattern matching.

As a compiler writer I have always had a strong interest in string or pattern matching and SNOBOL was a major step forward from traditional regular expression models. There were two obvious improvements over regular expressions that are easy to explain. Firstly the syntax is readable; patterns were built up over a multitude of lines to eventually be used – so there was no need to turn the meaning of them into an arcane collection of punctuation marks. This doesn’t really change what you can do; but it does help productivity, reliability and maintainability a great deal.

The second change was rather more radical; the syntax or features allowed for the processing of any context free grammar. That probably means nothing to anyone that had a life during college but to me it was important. It meant that a far broader range of things could be searched for. As an example: with SNOBOL4 (or ECL) you can search for a string that is 17 characters long and is composed of ‘a’s followed by ‘b’s followed by ‘c’s. You cannot do that with a singular regular expression. Put simply, SNOBOL could be used for a broader range of work.

Being easier to use than regular expressions and more powerful than regular expressions was nice: but what really ignited my imagination is that you could intersperse code with your patterns. Again I can hear readers yawning from here; but this is game changing. Traditional pattern matching follows a ‘fire, forget and then panic’ model: you create your pattern, let it run and then try to figure out what it did. This can be fairly disastrous; if you programmatically determine after the event that a sub-part of a complex expression was an invalid match then you may have missed a true match that the system would otherwise have found.

So, if that really is so bad – why are regular expressions popular? Because you can make them very, very efficient. A regular expression can be turned into a deterministic finite automata (DFA); which is a very tight compact piece of code that can search for the entire regular expression almost as fast as a single character search in a normal string library. Complex pattern matching for free? Almost.

SNOBOL4 took the approach that functionally mattered more than performance; so between their general context free grammar and ‘code in the pattern’ approach they could process almost anything – but very, very slowly.

When we designed the ECL parsing capability, taking the easy-to-read approach was a no-brainer. Taking the extra features to produce a context free parser was a must-do; but we didn’t want to lose the speed. Instead we developed a fragmented deterministic finite automata approach. A pattern consists primarily of one or more DFAs; with little pieces of compiled code in between to implement the extra capabilities. The result is that you only pay any penalty when you are using features you couldn’t do in a true regular expression and only for that part of the pattern that uses the features; the best of both worlds!

Best of all, fragmented DFAs allow for one of my favorite features – VALIDATE. VALIDATE takes two parameters; a pattern and an arbitrary ECL function that returns a Boolean result. VALIDATE defines that it first and foremost finds those strings that match the pattern (which it does at full DFA speed). Then for all those strings that match it passes the string to the ECL function which gets to bid on whether or not the pattern really matched. Better yet; the type of the result of VALIDATE is pattern – so it can be part of a larger pattern – and if it fails the pattern will back-track accordingly.

Want to write a pattern to find credit-card numbers in emails? Write a pattern to find the correct numbers of digits, write an ECL function to check the check-sum and then combine the two in a VALIDATE!

For those of you with very simple searching requirements, or who enjoy writing expressions that look like a fight scene in a comic book, (!!@@??***) we do have a pure dynamic regular expression capability; but for people interested in developing a library of sophisticated pattern matching routines VALIDATE really does change the game.

Getting Started with HPCC Systems

Getting Started with HPCC Systems

Let’s get started

Detailed documentation

Detailed documentation

Detailed documentation

Check out the Wiki

HPCC Systems Training

HPCC Systems Training

HPCC Systems Training

HPCC Systems Training

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

Welcome to the HPCC Systems developer community!

ECL Feature of the Week: VALIDATE