Prefix Trees can be an important addition to a big-data toolbox. In two prior posts I showed how to combine a prefix tree and an edit-distance algorithm on a big-data platform for a significant performance boost. In this post, I show how to further improve performance by layering on additional pruning strategies. A reasonable expectation here is an additional performance improvement of 10% to 50% based on the real-world data and the pruning strategies you use for your data.
Most of the power of using a prefix tree with an edit-distance algorithm is in the natural branch-pruning that automatically takes place. But the earlier you can prune a branch in a prefix tree, the greater the savings. This means improved performance and lower overall latency.
Many pruning strategies will be data-specific since each dataset can present its own unique set of opportunities. Different algorithms paired with a prefix tree will also present new opportunities. A good big-data platform will have little difficulty incorporating data-specific or algorithm-specific strategies here.
In order to explain this strategy, the post below assumes a common situation -- that there is a degree of variation in the length of the words in a target dataset. We can put that intelligence directly into a prefix tree and use it to prune branches we know we don't want. In this case it will be because a branch only has words that are too long or too short. Don't worry if this doesn't apply to your dataset. The thought process is the same even if the pruning criteria are different.
The performance gains here won't be as dramatic as in prior posts, but they are still well worth the effort. As mentioned above, a reasonable expectation here is an additional performance improvement of 10% to 50%.
Diving Into the Code
The post below represents changes to code from prior blog posts. I will walk you through each change step by step, but I suggest you start here and herewith those two prior blog posts. This post won't make as much sense without that prior context. As for the code changes below, you can find the modified code on GitHub in the verbosely-named folder "PruneUsingMaxMinChildWordLength." Be sure to read the ReadMe.MD in the main directory.
Integrating a Max/Min Pruning Strategy
We are going to need a place to store both a max and a min value. In this case, max and min represent the maximum and minimum word length of all the children underneath a node in the prefix tree. If a node in the prefix tree has a max of 7 and a min of 4 then that means all children of that node represent words that have a length of 4, 5, 6, or 7. If we have a query word of length 11 and a maximum edit distance of 2, we know that there is no way any word in that branch will be a candidate. This is because the shortest candidate we would accept would be 9 characters long (or 11 - 2). We can prune the whole branch immediately without further analysis. Because both max and min are a part of the ECL language, we're going to name our fields _max and _min. You can see these new fields below in the modifed record definition for a prefix tree.
PTLayout := RECORD UNSIGNED id; // Primary Key UNSIGNED parent_id;// The parent for this node. Zero is a root node STRING node; // Contains the payload for this node UNSIGNED1 _max; // The max length of any child word for this node UNSIGNED1 _min; // The min length of any child word for this node BOOLEAN is_word; // Indicates if this node is a word (an end-cap with no children) UNSIGNED4 machine; // The Thor machine this prefix tree node is on END;
When we "NORMALIZE" our dataset, we are breaking each word into its various nodes of the prefix tree. Here we want to preserve the length of the word each node comes from. I do this in the TRANSFORM used for the NORMALIZE. See the new _max and _min fields being used below.
PTLayout NormalizePTTransform(NodesLayout L, UNSIGNED4 C):= TRANSFORM SELF.id := GetID((STRING)L.node_ids, C); SELF.parent_id := GetID((STRING)L.node_ids, C-1); SELF.node := IF(C=L.node_cnt, L.word, GetNode(L.word, L.nodes, C)); SELF._max := LENGTH(L.word); SELF._min := LENGTH(L.word); SELF.is_word := IF(C=L.node_cnt, True, False); SELF.machine := Thorlib.Node(); END;
Next, we need to "ROLLUP" these values while calculating _max and _min. I was using DEDUP before to get rid of duplicate nodes across multiple words. Here I use ROLLUP instead. ROLLUP will not only dedup values, but it will also preserve the data I need to keep. In this case, I'm comparing _max's and _min's in the ROLLUP's transform.
sorted_normalized_pt_ds := SORT(normalized_pt_ds, id, LOCAL); PTLayout RollupPTTransform(PTLayout L, PTLayout R):=TRANSFORM SELF._max := MAX(L._max, R._max); SELF._min := MIN(L._min, R._min); SELF := R; END; final_pt_ds := Rollup(sorted_normalized_pt_ds, LEFT.id = RIGHT.id, RollupPTTransform(LEFT, RIGHT), LOCAL); //final_pt_ds := DEDUP(SORT(normalized_pt_ds, id, LOCAL),machine, id, KEEP 1, LEFT, LOCAL);
The changes to the prefix tree are done. Now when a prefix tree is generated, each node of the prefix will have a _max field and a _min field that represents the maximum and minimum word lengths for all the words under that node of the tree.
I could stop here and simply add the new pruning criteria to the LOOP below, but this would hobble any future efforts to debug what's happening with each loop iteration. So from here I add the _max and _min fields to the query dataset and pass in values from the prefix tree with each loop iteration.
QueryPTLayout := RECORD input_ds; STRING state := ''; DATA state_data := (DATA)'ab'; UNSIGNED node_id := 0; STRING node := ''; UNSIGNED1 _max := 255; UNSIGNED1 _min := 0; BOOLEAN is_word := False; STRING cumulative_nodes := ''; UNSIGNED1 cumulative_node_size := 0; UNSIGNED1 current_distance := 0; UNSIGNED1 final_distance := 0; END;
The new fields, _max and _min, need to be added to the transform used in the loop/join so that as we walk the tree, we have the latest _max and _min values from the prefix tree.
QueryPTLayout QueryPTTransform(QueryPTLayout L, PTLayout R) := TRANSFORM SELF.word := L.word; // The query word SELF.state := IF(R.is_word, L.state, CalculateLevenshteinVector(L.word, R.node, L.state)); SELF.state_data := (DATA)SELF.state; // The state data reformatted for easier debugging SELF.node_id := R.id; // The current node ID of the prefix tree. // This will be joined against parent id to find the next node SELF.node := R.node; // The text of the node SELF._max := R._max; // The minimum word size for all the children of this node SELF._min := R._min; // The maximum word size for all the children of this node SELF.is_word := R.is_word; // If true, this is an end-cap word in the tree and has no children SELF.cumulative_node_size := IF(R.is_word, LENGTH(R.node), LENGTH(R.node) + L.cumulative_node_size); SELF.cumulative_nodes := IF(R.is_word, R.node, L.cumulative_nodes + R.node); SELF.current_distance := IF(R.is_word, L.current_distance, GetMinDistance(SELF.state)); SELF.final_distance := IF(R.is_word, GetFinalDistance(SELF.state), L.final_distance); END;
Finally, two lines of logic need to be added to the join embedded in the loop. This new logic is what prunes branches based on _max and _min. The range of values encompased by _max and _min need to be expanded by MAX_DISTANCE, or the maximum edit distance this query.
looped_ds := LOOP(query_ds, LEFT.is_word = False, EXISTS(ROWS(LEFT)) = True, JOIN(ROWS(LEFT), final_pt_ds, LEFT.node_id = RIGHT.parent_id AND LEFT.current_distance <= MAX_DISTANCE AND LENGTH(LEFT.word) <= (RIGHT._max + MAX_DISTANCE) AND LENGTH(LEFT.word) >= (RIGHT._min - MAX_DISTANCE), QueryPTTransform(LEFT,RIGHT), INNER)); OUTPUT(looped_ds(looped_ds.final_distance <= MAX_DISTANCE), NAMED('Final'));
That's it. We started this blog post with a fast way to query a very large data-set using an edit-distance algorithm and now we've significantly improved real-world performance by layering an additional pruning strategy on top. From here, additional data-specific pruning strategies can be added to further enhance performance. Check out the code examples for this post on GitHub. There you will find the three examples from prior posts (in-line dataset, Thor file, and Roxie query with a Thor data build) already modified with the max/min pruning strategy.
Permalink Submitted by ckaminski on Sat, 01/07/2017 - 4:14pm
I've created a set of function macros that can do the heavy lifting for you. Check them out at https://github.com/Charles-Kaminski/PrefixTree