Skip to main content

Prefix Trees can be an important addition to a big-data toolbox. In two prior posts I showed how to combine a prefix tree and an edit-distance algorithm on a big-data platform for a significant performance boost. In this post, I show how to further improve performance by layering on additional pruning strategies. A reasonable expectation here is an additional performance improvement of 10% to 50% based on the real-world data and the pruning strategies you use for your data.

Leopard

Most of the power of using a prefix tree with an edit-distance algorithm is in the natural branch-pruning that automatically takes place. But the earlier you can prune a branch in a prefix tree, the greater the savings. This means improved performance and lower overall latency.

Many pruning strategies will be data-specific since each dataset can present its own unique set of opportunities. Different algorithms paired with a prefix tree will also present new opportunities. A good big-data platform will have little difficulty incorporating data-specific or algorithm-specific strategies here.

In order to explain this strategy, the post below assumes a common situation -- that there is a degree of variation in the length of the words in a target dataset. We can put that intelligence directly into a prefix tree and use it to prune branches we know we don't want. In this case it will be because a branch only has words that are too long or too short. Don't worry if this doesn't apply to your dataset. The thought process is the same even if the pruning criteria are different.

The performance gains here won't be as dramatic as in prior posts, but they are still well worth the effort. As mentioned above, a reasonable expectation here is an additional performance improvement of 10% to 50%.

Diving Into the Code

The post below represents changes to code from prior blog posts. I will walk you through each change step by step, but I suggest you start here and herewith those two prior blog posts. This post won't make as much sense without that prior context. As for the code changes below, you can find the modified code on GitHub in the verbosely-named folder "PruneUsingMaxMinChildWordLength." Be sure to read the ReadMe.MD in the main directory.

Integrating a Max/Min Pruning Strategy

We are going to need a place to store both a max and a min value. In this case, max and min represent the maximum and minimum word length of all the children underneath a node in the prefix tree. If a node in the prefix tree has a max of 7 and a min of 4 then that means all children of that node represent words that have a length of 4, 5, 6, or 7. If we have a query word of length 11 and a maximum edit distance of 2, we know that there is no way any word in that branch will be a candidate. This is because the shortest candidate we would accept would be 9 characters long (or 11 - 2). We can prune the whole branch immediately without further analysis. Because both max and min are a part of the ECL language, we're going to name our fields _max and _min. You can see these new fields below in the modifed record definition for a prefix tree.

PTLayout := RECORD
  UNSIGNED  id;       // Primary Key
  UNSIGNED  parent_id;// The parent for this node.  Zero is a root node
  STRING    node;     // Contains the payload for this node
  UNSIGNED1 _max;     // The max length of any child word for this node
  UNSIGNED1 _min;     // The min length of any child word for this node
  BOOLEAN   is_word;  // Indicates if this node is a word (an end-cap with no children)
  UNSIGNED4 machine;  // The Thor machine this prefix tree node is on
END;

When we "NORMALIZE" our dataset, we are breaking each word into its various nodes of the prefix tree. Here we want to preserve the length of the word each node comes from. I do this in the TRANSFORM used for the NORMALIZE. See the new _max and _min fields being used below.

PTLayout NormalizePTTransform(NodesLayout L, UNSIGNED4 C):= TRANSFORM
  SELF.id        := GetID((STRING)L.node_ids, C);
  SELF.parent_id := GetID((STRING)L.node_ids, C-1);
  SELF.node      := IF(C=L.node_cnt, L.word, GetNode(L.word, L.nodes, C));
  SELF._max      := LENGTH(L.word);
  SELF._min      := LENGTH(L.word);
  SELF.is_word   := IF(C=L.node_cnt, True, False);
  SELF.machine   := Thorlib.Node();
END;

Next, we need to "ROLLUP" these values while calculating _max and _min. I was using DEDUP before to get rid of duplicate nodes across multiple words. Here I use ROLLUP instead. ROLLUP will not only dedup values, but it will also preserve the data I need to keep. In this case, I'm comparing _max's and _min's in the ROLLUP's transform.

sorted_normalized_pt_ds := SORT(normalized_pt_ds, id, LOCAL);

PTLayout RollupPTTransform(PTLayout L, PTLayout R):=TRANSFORM
  SELF._max := MAX(L._max, R._max);
  SELF._min := MIN(L._min, R._min);
  SELF      := R;
END;

final_pt_ds := Rollup(sorted_normalized_pt_ds, LEFT.id = RIGHT.id, 
                      RollupPTTransform(LEFT, RIGHT), LOCAL);
//final_pt_ds := DEDUP(SORT(normalized_pt_ds, id, LOCAL),machine, id, KEEP 1, LEFT, LOCAL);

The changes to the prefix tree are done. Now when a prefix tree is generated, each node of the prefix will have a _max field and a _min field that represents the maximum and minimum word lengths for all the words under that node of the tree.

I could stop here and simply add the new pruning criteria to the LOOP below, but this would hobble any future efforts to debug what's happening with each loop iteration. So from here I add the _max and _min fields to the query dataset and pass in values from the prefix tree with each loop iteration.

QueryPTLayout := RECORD
  input_ds;
  STRING    state                := '';
  DATA      state_data           := (DATA)'ab';
  UNSIGNED  node_id              := 0;
  STRING    node                 := '';
  UNSIGNED1 _max                 := 255;
  UNSIGNED1 _min                 := 0;
  BOOLEAN   is_word              := False;
  STRING    cumulative_nodes     := '';
  UNSIGNED1 cumulative_node_size := 0;
  UNSIGNED1 current_distance     := 0;
  UNSIGNED1 final_distance       := 0;
END;

The new fields, _max and _min, need to be added to the transform used in the loop/join so that as we walk the tree, we have the latest _max and _min values from the prefix tree.

QueryPTLayout QueryPTTransform(QueryPTLayout L, PTLayout R) := TRANSFORM
  SELF.word                 := L.word;    // The query word
  SELF.state                := IF(R.is_word, L.state, CalculateLevenshteinVector(L.word, R.node, L.state));
  SELF.state_data           := (DATA)SELF.state; // The state data reformatted for easier debugging
  SELF.node_id              := R.id;      // The current node ID of the prefix tree.  
                                          // This will be joined against parent id to find the next node
  SELF.node                 := R.node;    // The text of the node
  SELF._max                 := R._max;    // The minimum word size for all the children of this node
  SELF._min                 := R._min;    // The maximum word size for all the children of this node
  SELF.is_word              := R.is_word; // If true, this is an end-cap word in the tree and has no children
  SELF.cumulative_node_size := IF(R.is_word, LENGTH(R.node), LENGTH(R.node) + L.cumulative_node_size);
  SELF.cumulative_nodes     := IF(R.is_word, R.node, L.cumulative_nodes + R.node);
  SELF.current_distance     := IF(R.is_word, L.current_distance, GetMinDistance(SELF.state));
  SELF.final_distance       := IF(R.is_word, GetFinalDistance(SELF.state), L.final_distance);	
END;

Finally, two lines of logic need to be added to the join embedded in the loop. This new logic is what prunes branches based on _max and _min. The range of values encompased by _max and _min need to be expanded by MAX_DISTANCE, or the maximum edit distance this query.

looped_ds := LOOP(query_ds, LEFT.is_word = False, EXISTS(ROWS(LEFT)) = True,
                  JOIN(ROWS(LEFT), final_pt_ds, 
                       LEFT.node_id = RIGHT.parent_id AND 
                       LEFT.current_distance <= MAX_DISTANCE AND
                       LENGTH(LEFT.word) <= (RIGHT._max + MAX_DISTANCE) AND
                       LENGTH(LEFT.word) >= (RIGHT._min - MAX_DISTANCE),
                       QueryPTTransform(LEFT,RIGHT), INNER));                              

OUTPUT(looped_ds(looped_ds.final_distance <= MAX_DISTANCE), NAMED('Final'));

That's it. We started this blog post with a fast way to query a very large data-set using an edit-distance algorithm and now we've significantly improved real-world performance by layering an additional pruning strategy on top. From here, additional data-specific pruning strategies can be added to further enhance performance. Check out the code examples for this post on GitHub. There you will find the three examples from prior posts (in-line dataset, Thor file, and Roxie query with a Thor data build) already modified with the max/min pruning strategy.

-Charles

Comments

Update

Permalink Submitted by ckaminski on Sat, 01/07/2017 - 4:14pm

I've created a set of function macros that can do the heavy lifting for you.  Check them out at https://github.com/Charles-Kaminski/PrefixTree

-Charles