Edit Distances and Optimized Prefix Trees for Big Data

In this blog post I will show you how to use a big-data platform to build a fast prefix tree and why such a prefix tree can enable very fast edit-distance queries. In a follow-up post I offer different query examples, performance metrics, and code examples.

An edit-distance algorithm, such as the Levenshtein Distance algorithm, can help you find imperfect matches. Edit-distance algorithms quantify the difference between two sequences of characters by calculating how many changes (or edits) would need to be made in one character string to get the other character string. My elementary school teacher used to hand out brain teasers that included questions like, “What are the steps needed to turn this word into that one?” Where was Levenshtein when I needed him? These algorithms work on word pairs from my elementary school teacher as well as any other data that can be represented as a string of characters. Examples include sentences, vectors through an n-dimensional graph, or even something a bit more specific to an industry like a list of gene sequences in DNA data. Edit-distance algorithms can even be modified to work with binary data.

Because of how these algorithms work, a data engineer would need to recalculate the edit-distance for every entry in the list each time a new query is performed. The data engineer would do this to quantify the “distance” between the query string and each candidate in the target list. Candidates would be selected based on a low edit distance.

Such an approach can work very well in some circumstances, but recalculating the edit distance for every entry in a list is a computationally intensive approach for large lists. It can be too slow for very large lists. It can also be too slow for just a moderate number of queries. It is almost always too slow for interactive queries where you need sub-second responses.

Fortunately, there is a way to organize a list more intelligently to get more bang out of an edit-distance algorithm. Data engineers have been using prefix trees for some time to do just this. Prefix trees break up data into a tree structure and consolidate identical prefixes into the same nodes.

prefix_tree.png

The diagram to the right is an example of a prefix tree. I used the list of words on the right side of the diagram to build this sample prefix tree. In practice you might have a very big prefix tree with many branches and many levels.

Building a prefix tree is half the battle in using edit distances on big data. Once you have your data organized in a prefix tree, you then need to break up your edit-distance algorithm. This is so you can calculate interim values as you walk the prefix tree with your query strings. It’s not too hard to do. I will show you an example in my second post.

So conceptually, you have a prefix tree and you have an edit distance algorithm with interim values. The last thing you need for a big-data solution is some declaration that you will only accept an edit distance no greater than a selected value. For now, let’s use an edit distance of 2. Now, with these tools in hand, you can walk your prefix tree with your query string and your edit distance algorithm, and prune any branches where your minimum interim value is greater than 2. With the right dataset, you will prune many many branches and significantly speed up your query.

Building a prefix tree and walking it with an edit distance algorithm is a relatively simple tasks for lists that fit into memory. There are examples on the Internet for most major programming languages. But what about very large lists? What about a list that’s hundreds of millions of entries long or billions of entries long? What about lists that span multiple computing nodes and won’t fit into memory? Then you need a big data solution and that solution might not be quite so obvious.

There is a naive approach to building a horizontally-scalable prefix tree for very large datasets. For this approach, you take a cumulative hash of each of the characters for each of the entries in your list. Take the word “category” for example. You would push each character of the word through a hash algorithm. As you add each character to the algorithm, you would return each character as a node along with its cumulative hash as an identifier and the cumulative hash of the previous character as a parent id. In this way the path to “t” in the word “category” would be identical to the path to “t” in the word “categories”. Because the paths are the same (“cat”), both would share the same hash id and parent hash id. You would do this independently across multiple compute nodes. From there you would de-duplicate the new list you’ve generated to represent the prefix tree.

The downsides to this naive approach are significant. There are lots of data generated because hash keys are quite large and you need two hash values for each character (node id and parent id). There no opportunity to consolidate nodes without revisiting all of the branches of the tree across all of the computing nodes of your system. Fortunately, there is a better way. It is possible to create consolidated nodes from the start. It is also possible to assign node ids and parent ids without resorting to hash keys. This makes lookups much faster and eliminates volumes of unneeded hash ids.

In the rest of this blog post I focus on doing just that — building optimized horizontally-scalable prefix trees for big data. I demonstrate how to take an initial dataset, identify nodes, and assign node ids. In a follow up post, I will show you how to query a prefix tree on a horizontally-scalable system using an edit distance algorithm. From there, you will be able to apply these techniques to a wide variety of data.

Building Prefix Trees

There are just a handful of distributed operations (8 to be precise) needed to efficiently solve most big-data problems (~95% – 98% of them). As we model our big-data solution together to build our prefix tree, we’re going to focus on two operations. These are Iterate and Normalize. These operations are the work horses around the novel part of our big-data solution.

Iterate processes records in a dataset allowing context between each record.

Normalize processes each record within a dataset multiple times to create multiple records per input record.

These may be new concepts to you or you might associate these words with different meanings. If so, take a moment to review the definitions of iterate and normalize again.

Now let’s take a look at some example data we’ll use to apply these concepts.

Words
cant
cat
categories
category
cave
caved
cent
cept
corded
dog
drop
dropping

Notice that the list is sorted. If you compare any two adjacent words you can begin to see where prefixes overlap. We need to turn this observation into a set of rules to build our prefix tree. To identify nodes you need to:

  1. Start with the first entry. It gets compared to nothing (this is how iterate initializes). The beginning of the first node is where the prefix overlap ends. In this case, the first letter of the first entry is the beginning of a node.
  2. Move on to the second entry and compare it with the first entry. Carry forward any nodes from the previous step that are present before the beginning of the next node. The beginning of the next node is where the prefix overlap ends.
  3. Move on to the third entry and compare it to the second. Carry forward any nodes from the previous step that are present before the beginning of the next node. The beginning of a new node is where the prefix overlap ends.
  4. Continue this way through the list.
  5. To capture any nodes missed, go through the list again but in reverse order.

Walking Throught the Example

  1. The first word in the list is “cant”. This word is compared to nothing. That means the first character is where the overlap ends and is the begining of the new node. A new node of our prefix tree is identified below with a “1”.
  2. The second word “cat” is compared with the first word “cant”. “ca” overlaps in both words. The “t” in “cat” is the end of that overlap. “t” becomes the begining of the new node and any existing nodes in the overlap are brought forward.
  3. The next two words to be compared are “cat” and “categories”. The first three letters, “cat”, overlap in each word. The “e” in categories is the end of that overlap and is marked as the next node. Any nodes in the overlap are brought forward.
  4. The process continues throught the word list. The list is first processed in the forward direction. Then the list is processed in the reverse direction to pick up nodes that are identified later in the process.
Words
cant
cat
categories
category
cave
caved
cent
cept
corded
dog
drop
dropping
First Pass
1000
101
1011000000
101100001
1010
10101
1100
1110
110000
100
1100
11001000
Second Pass (In Reverse)
1010
101
1011000010
101100001
1010
10101
1110
1110
110000
110
1100
11001000

Coding a Solution

To code the solution you can use ECL’s iterate operation to compare two entries. From there, you can use normalize to break out all of the nodes from each record. Finally, a dedup operation will get rid of duplicates. If the rest of these operations don’t make sense to you, don’t worry. We’ll go over them in the code example.

The code below simply defines a dataset of words for us to work with and outputs the dataset:

WordLayout := RECORD
   STRING   word;
END;

input_ds:= DATASET([{'cant'},{'cat'},{'categories'},{'category'},{'cave'},{'caved'},
                    {'cent'},{'cept'},{'corded'},{'dog'},{'drop'},{'dropping'}], WordLayout);

output(input_ds, named('input_dataset'));

This next section of code extends the last layout with fields we will need later. We’ll also intelligently set initial values:

NodesLayout := RECORD
   input_ds;
   // The nodes field is where we keep track of when nodes begin.  We're initializing the field to 
   // zero fill with the same number of zeros as characters in each word.
   // The extra '1' at the end is a place holder for an end-cap node representing the word itself.
   STRING    nodes        := REGEXREPLACE('.', input_ds.Word, '0') + '1'; 
   UNSIGNED4 node_cnt     := 0;        // The number of nodes in the word
   UNSIGNED4 compute_node := 0;        // The compute_node this record is on
   DATA      node_ids     := (DATA)''; // Where we pack up the node IDs right before we normalize the prefix tree records
END;
nodes_format_ds := TABLE(input_ds, NodesLayout, UNSORTED, LOCAL);
output(nodes_format_ds, NAMED('reformated_list'));

Below we are using in-line C++ to create a function that will mark where nodes start and end. We are using in-line C++ for maximum portability, but we could have used another language such as Python or Java (The HPCC supports a number of languages). This function will be plugged into the Transform that is part of our first and second Iterate operations:

STRING MarkNodes(STRING l, STRING l_nodes, STRING n, STRING n_nodes) := BEGINC++
   /// This C++ function is designed for use with ECL's Iterate to take a group of 
   /// strings and return the prefix tree node structures in a separate field.
   /// The function requires a sorted list and two passes (one in each direction) 
   /// l is for the last field, n is for the next field, out is for output
   /// nodes contain the indicator of if we should start a node
   /// An initial value of the node for each field is a string of 0 [zeros]
   /// the same length as the string in question -- an optional 1 can be
   /// added at the end as a word end-cap.
   /// CFK 08/05/2015   
   #option pure
   #body
   unsigned long i = 0;

   char * out = (char*)rtlMalloc(lenN_nodes);
   memcpy(out, n_nodes, lenN_nodes);
   while ( i < lenN){
      /// If the characters don't match or
      ///   if we are at the end of the last string then
      ///   flip the 0 to a 1 and break out of the loop
      if (l[i] != n[i] || i >= lenL){
         out[i] = '1';
         break;
      }
      if (l_nodes[i] != n_nodes[i]){
         out[i] = '1';
      }
      i += 1;
   }
   __lenResult = lenN_nodes;
   __result = out;
ENDC++;

Here we are defining our transform to calculate where nodes begin. Also we are counting the number of nodes we find (by counting the “1”s). Finally, we keep track of which compute node is doing the processing (useful information to have for debugging and for later). With the two Iterate operations, we go through the list once in each direction. The second time is to carry back any nodes that got created in the first pass but created later in the list:

NodesLayout GetPTNodesTransform(NodesLayout L, NodesLayout R) := TRANSFORM
   SELF.nodes        := MarkNodes(L.word, L.nodes, R.word, R.nodes);
   SELF.node_cnt     := stringlib.StringFindCount(SELF.nodes,'1');
   SELF.compute_node := Thorlib.Node();
   SELF := R;
END;

pt_nodes_1_ds := ITERATE(SORT(nodes_format_ds, -word, LOCAL), GetPTNodesTransform(LEFT, RIGHT), LOCAL);
pt_nodes_2_ds := ITERATE(SORT(pt_nodes_1_ds,    word, LOCAL), GetPTNodesTransform(LEFT, RIGHT), LOCAL);

output(pt_nodes_2_ds, named('find_pt_nodes'));

Here again we use in-line C++ along with Iterate to help us assign node IDs. Notice that we compare like nodes for finding existing IDs and we use the last ID assigned in the previous record to tell us what the next node ID should be.

STRING AssignNodeIDs(STRING l, STRING l_nodes, UNSIGNED4 l_node_cnt, STRING l_node_ids,
                     STRING n, STRING n_nodes, UNSIGNED4 n_node_cnt, UNSIGNED4 compute_node) := BEGINC++
   /// This C++ function is designed for use with ECL's Iterate
   /// It assigns ID's to prefix tree nodes so that the nodes can later be broken out into 
   /// individual records with ECL's Normalize.
   /// CFK 08/05/2015   
   #option pure
   #body      
   unsigned long long last_node_id = 0;
   unsigned long i = 0;
   unsigned long node_cnt = 0;

      // The new value is empty.  
      // Could happen if there are blanks in your word list.
      if (n_node_cnt == 0){
         __lenResult = 0;
         return;
      }
      
      // Initialize "out" with node ID's from the last iteration
      //  But don't go over if there are too many node ID's from the last iteration
      char * out = (char*)rtlMalloc(n_node_cnt * 8);   
      i = (l_node_cnt < n_node_cnt) ? l_node_cnt : n_node_cnt;
      memcpy(out, l_node_ids, i * 8);

      // Find the ID assigned to the last node from the last word.
      //  This will be the starting point to assign new node IDs
      if (l_node_cnt != 0){
         memcpy(&last_node_id, &l_node_ids[(l_node_cnt - 1) * 8], 8);
      }
      else{
         // This is the starting point for node id on each compute node.
         // Setting last_node_id to a value based on the compute node's
         // unique id ensures that each prefix tree built locally on each 
         // compute node has unique non-overlapping ID's.  An unsigned long long
         // has a maximum value of 9,223,372,036,854,775,807 or 2^64 - 1.   
         // Multiplying compute_node by 10^14 allows for 92,233 compute nodes  
         // and 72 Trillion prefix tree nodes on each compute node.  These maximums
         // can be adjusted to fit your needs.
         last_node_id = (compute_node * 100000000000000);
      }	  

      // Find the first location where the two words don't match
      //  And keep track of how many nodes we've seen
      i = 0;
      while (i < lenL && i < lenN){
         if (l[i] != n[i]){
            break;
         }
         if (n_nodes[i] == '1'){
            node_cnt += 1;
         }
         i += 1;
      }
      
      // Start adding new node ids to "out" from where we left off
      while (i < lenN){
         if (n_nodes[i] == '1'){
            last_node_id += 1;
            node_cnt += 1;
            memcpy(&out[(node_cnt-1)*8], &last_node_id, 8);
         }
         i += 1;
      }
      
      // Add the final node id for the end-cap (referenced in comments at the top)
      last_node_id += 1;
      node_cnt +=1;
      memcpy(&out[(node_cnt-1) * 8], &last_node_id, 8);

    __lenResult = n_node_cnt * 8;
    __result    = out;
ENDC++;

NodesLayout GetNodeIdsTransform(NodesLayout L, NodesLayout R) := TRANSFORM
   SELF.compute_node := Thorlib.Node();
   SELF.node_ids    := (DATA) AssignNodeIDs(L.word, L.nodes, L.node_cnt, (STRING) L.node_ids,
                                    R.word, R.nodes, R.node_cnt, SELF.compute_node);
   SELF := R;
END;

node_ids_ds := ITERATE(SORT(pt_nodes_2_ds,  word, LOCAL), GetNodeIdsTransform(LEFT, RIGHT), LOCAL);

output(node_ids_ds, named('get_node_ids'));

Here we use Normalize to break the nodes out into individual records. We use a function to assign our node IDs and parent node IDs to individual records. We will need parent IDs later to query our prefix tree. We use another function to get the node values. Finally we identify if the node represents a full word (what we called an end-cap in the code documentation). This end cap will simplify querying our prefix tree.

PTLayout := RECORD
   UNSIGNED  id;               // Primary Key
   UNSIGNED  parent_id;        // The parent for this node.  Zero is a root node
   STRING    node;             // Contains the payload for this node
   BOOLEAN   is_word;          // Indicates if this node is a word (an end-cap with no children)
   UNSIGNED4 compute_node;     // The compute-node ID this record was processed on
END;

UNSIGNED GetID(STRING node_ids, UNSIGNED4 node_to_retrieve) := BEGINC++
   /// Function returns a numeric node id from a group of node ids
   /// CFK 08/05/2015   
   #option pure
   #body   
   unsigned long long node_id = 0;
   if (node_to_retrieve != 0){
      memcpy(&node_id, &node_ids[(node_to_retrieve - 1) * 8], 8);
   }
   return node_id;
ENDC++;

STRING GetNode(STRING word, STRING nodes, UNSIGNED4 node_to_retrieve):= BEGINC++
   /// Function returns the substring associated with a node
   /// CFK 08/05/2015
   #option pure
   #body   
   unsigned int i = 0;
   unsigned int j = 0;
   unsigned int k = 0;
   
   // find the start (i) of the node in question
   while(i < lenNodes){
      if(nodes[i] == '1'){
         j +=1;
      }
      if (j==node_to_retrieve){
         break;
      }
      i += 1 ;
   }
   
   j = i + 1;
   
   // find the end (j) of the node in question
   while(j < lenNodes){
      if (nodes[j] == '1'){
         break;
      }
      j += 1;
   }
   
   k = j - i;
   char * out = (char*)rtlMalloc(k);
   memcpy(&out[0], &word[i], k);

   __lenResult = k;
   __result   = out;
   
ENDC++;

PTLayout NormalizePTTransform(NodesLayout L, UNSIGNED4 C):= TRANSFORM
   SELF.id           := GetID((STRING)L.node_ids, C);
   SELF.parent_id    := GetID((STRING)L.node_ids, C-1);
   SELF.node         := IF(C=L.node_cnt, L.word, GetNode(L.word, L.nodes, C));
   SELF.is_word      := IF(C=L.node_cnt, True, False);
   SELF.compute_node := Thorlib.Node();
END; 

/// Break each word record into multiple node records
normalized_pt_ds := NORMALIZE(node_ids_ds, LEFT.node_cnt, NormalizePTTransform(LEFT,COUNTER));

output(normalized_pt_ds, named('normalize_pt_ds'));

Finally, we de-duplicate our results.

 final_pt_ds := DEDUP(SORT(normalized_pt_ds, compute_node, id, LOCAL),compute_node, id, KEEP 1, LEFT, LOCAL);  output(final_pt_ds, named('final_pt_ds'));

Assumptions About Your Data and Performance

The example code above makes two assumptions about the data in question. It assumes that the data is de-duplicated. It also assumes that the data is distributed so that, as much as possible, adjacent records sit on the same computing node. This keeps duplicate tree nodes to a minimum while taking advantage of the LOCAL key words liberally used in the code above.

The LOCAL keywords tell the HPCC that it can perform certain tasks on each compute node independently without waiting for or coordinating with other compute nodes.

Because of the LOCAL key word, there may be a very small number of duplicate tree nodes in your final prefix tree. These would only occur at the boundary points between two compute nodes. This is handled in the code above by incorporating compute-node IDs into the node IDs of the prefix tree. The code above will work if you take out all of the LOCAL key words above but leaving them in significantly speeds up processing for very large datasets. Ultimately, any duplicates come out in the wash when you search your prefix tree. I’ll cover searching your prefix tree in mynext post where I show how to query a prefix tree on a horizontally-scalable system using an edit distance algorithm, offer performance metrics, and provide code examples for download.

 

###Update###

Submitted by ckaminski on Sat, 01/07/2017 – 4:15pm

I’ve created a set of function macros that can do the heavy lifting for you.  Check them out at https://github.com/Charles-Kaminski/PrefixTree

-Charles