Bloom filters – What are they and how do they work?
A Bloom filter, named after its inventor Burton Howard Bloom, is a data structure that can be used to perform a cheap test for the potential presence of a particular value, in a way that is much faster than looking up the value in an index, requiring much less storage than the index would. Note the “potential” there. The Bloom filter can tell you for certain if a value is not present, but it cannot say for certain that a value is present, only that it may be present. In other words, using a Bloom filter to determine whether a value is present may give false positives, but never false negatives.
How do Bloom Filters work?
There are more details in the Wikipedia article (see the link above), but basically a Bloom filter works by hashing each value in the set you are searching, storing a bit in a table to indicate that hash value (modulo the table size) has been seen. When looking up a value, if the bit corresponding to the hash of the value you are searching for is set, then the search value may be present (or another value that hashed to the same bit position is). If the bit is clear, then the search value is definitely not present. In order to decrease the likelihood of false positives, each value is hashed multiple times (using different seeds) and therefore multiple bits are set for each value in the table. When looking up a value, if any of the hashes generated for your search value lead to an unset bit in the table, then the search value must be absent. If all of them lead to a hit, then there is a strong probability that the search value is present.
Why would you use a Bloom filter?
Looking up a value in an index is a very common operation in typical ECL code, especially in ROXIE queries, but it’s also relatively slow, requiring a lot of data to be loaded from disk and processed. Using a Bloom filter to see whether a value exists in the index before taking the time to perform the index lookup can therefore provide a significant performance advantage. Because a Bloom filter never gives a false negative, if the Bloom filter indicates that a value is not in the index, there’s no need to do the lookup.
One situation where bloom filters can make a large difference is when you are using super keys to support incremental updates to an index. Here you have a main index containing the bulk of the data and multiple smaller indexes which contain incremental updates. Unfortunately, finding matches in the index requires that each of the indexes in the superkey are searched, so the time taken can be proportional to the number of indexes in the superkey. Bloom filters allow many of the incremental index parts to be skipped, since the incremental updates only include a small subset of the values, significantly reducing the cost of using incremental keys.
How can you use a Bloom filter in ECL?
There has been a Bloom filter bundle available for ECL for many years, but it’s not easy to use in conjunction with an index lookup. But in HPCC Systems 7.0.0, Bloom filter lookup technology has been built into the index lookup code so that it will be applied automatically. The BLOOM keyword can be used when building an index to specify that a Bloom filter (or filters) should be constructed while building the index. No special syntax is required when using an index. If the appropriate Bloom filter information is present in the index, it will be used automatically.
Multiple Bloom filters and trailing field Bloom filters
Note that I said:
‘The BLOOM keyword can be used when building an index to specify that a Bloom filter (or filters) should be constructed while building the index.’
A single index can have multiple associated Bloom filters used for different fields (or combinations of fields) in the index. An index search will be checked against any applicable bloom filters before looking in the full index. Which Bloom filters are applicable will depend on the filter conditions used for the index search. If an exact match condition has been specified for a field and a Bloom filter for that field was constructed when the index was built, then that Bloom filter will be checked.
Specifying multiple fields when adding a Bloom filter to an index allows more specific combinations to be checked. For example, if you had a bloom filter on Firstname and another on Lastname, and you were searching an index of Premiership football players for “Richard Chapman”, then using separate Bloom filters would say “Yes, there is a match for Richard” and “Yes, there is a match for Chapman”, and so would not save any lookup in the index. But a Bloom filter constructed on the combined Firstname, Lastname pair would most likely return “No match”. However, the Bloom filter used in the latter lookup would be significantly larger.
Bloom filters are not restricted to leading fields of the index, although they are restricted to keyed fields. You cannot specify a payload field as part of a Bloom filter. Bloom filters on trailing fields may be useful in conjunction with WILD filters, to reduce the pain of a wildcarded filter lookup. But remember that when the Bloom filter does indicate a match (or a potential match), the wildcarded index lookup will still occur, and will be just as painful as it always was.
Overhead and times not to use one
Of course, nothing comes for free in computer science and with a Bloom filter there is a tradeoff to be made between the size of the table required to store the bits, the number of different hashes to compute for each value and the probability of giving a false positive. Fortunately, there is a simple formula that can be used to compute the required table size and number of hashes, given the number of distinct values to store and the desired probability of a positive answer being incorrect. Even more fortunately, the ECL index build activity will apply this formula for you automatically. You just supply the probability value and the build process determines the rest (it can determine the number of distinct values as it is building the index). A larger value for the probability will give a smaller table and faster checks, but may mean that more full index lookups are performed unnecessarily. A smaller value for the probability will conversely mean that the Bloom lookup is more expensive (in both space and time) but that fewer unnecessary index checks are done.
Even though a Bloom filter check is very cheap compared to a full index lookup, there are times when it makes no sense to do one. For example, when using a value that you know is always present in the index, or where almost all values that are searched for are present. In these cases, you can just say BLOOM(FALSE) on the index build and no bloom filters will be created.
At present, in the HPCC Systems 7.0.0 beta release, we default to adding a Bloom filter for the first field in every index. Whether this is a sensible default remains to be seen (but it does ensure that the code gets a good workout!). To limit the storage usage from high cardinality fields, we will not create a bloom filter if the cardinality of the field (or field combination) for a Bloom filter exceeds 100000 per index part. This limit can be overridden by specifying LIMIT(n) on the Bloom attribute of the index build.
A concrete example
Suppose you have an index used to lookup from a customerID to retrieve information about recent orders:
rec := RECORD unsigned customerID; unsigned date; unsigned orderID; END; myindex := INDEX(mydata, {customerId, date}, {orderId}, ‘keys::orderIndex’); And you want to build indexes every hour so that recent info can be searched.
There are typically two options in ECL for updating these indexes.
- Rebuild the whole index every hour. This may take a while and/or require a large Thor cluster, but has the advantage that the lookup will be fast as only one location will need searching (and thus reading from disk).
- Build partial indexes of recent data, rolling them into the main index periodically and adding them to a superkey so that any lookups in the superkey will search all the partial keys. The advantage is that the partial indexes are much faster to build, but the disadvantage is that searching multiple subkeys takes longer than looking up in just one. Typically it takes N times as long where N is the number of keys in the superkey, as there are N times as many locations in the indexes that need to be read (this also means that the index cache will be filled at N times the rate).
However by using a BLOOM attribute when building the partial indexes:
BUILDINDEX(newdata, { customerId, date}, {orderId}, ‘keys::partialkey’, BLOOM(customerId));
The lookups in the partial keys (where a match will be unlikely) will be largely avoided (and so will the associated cache churn). It may not be worth having a BLOOM filter specified on the main rolled-up index, since in this application you would expect most lookups to find a match there.
If you were looking up by customerId,date rather than just customerId, you can specify a bloom filter on both fields:
BUILDINDEX(newdata, { customerId, date}, {orderId}, ‘keys::partialkey’, BLOOM(customerId, date));