Bloom Filter - Introduction

Bloom Filter has become a hot topic in interviewing.

A bloom filter is a space-efficient probabilistic data structure designed to tell you an element either possibly in set or definitely not in set. That’s saying, false positive matches are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

How to store a value

An empty Bloom Filter is an bit array of m bits, all set to 0.

To store a value, there must also be either

  • k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

    Typically, k is a constant, much smaller than m (k << m), which is proportional to the number of elements to be added; the precise choice of k and the constant of proportionality of m are determined by the intended false positive rate of the filter. Feed the value to each of the k hash functions to get k array positions.

  • or one hash function which hashes bits of the value and returns k array positions.

    • there’re several ways to achieve this, we’ll talk about this later

To add an element, Set the bits at all these positions to 1.

How to query a value

To query for an element (test whether it is in the set), get the k array positions by either

  • feeding it to each of the k hash functions
  • or that single hash function which returns k positions.

If any of the bits at these positions is 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted.

If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple Bloom Filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.

Space and Time Advantage

Space

While risking false positives, Bloom Filter has a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of those require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes).

However, Bloom Filter does not store the data items at all, and a separate solution must be provided for the actual storage. Linked structures incur an additional linear space overhead for pointers. A Bloom Filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element.

However, if the number of potential values is small and many of them can be in the set, the Bloom Filter is easily surpassed by the deterministic bit array, which requires only one bit for each potential element. Note also that hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom Filters with k = 1

Time

Bloom Filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set.


No other constant-space set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom Filters. In a hardware implementation, however, the Bloom Filter shines because its k lookups are independent and can be parallelized.

To understand its space efficiency, it is instructive to compare the general Bloom Filter with its special case when k = 1. If k = 1, then in order to keep the false positive rate sufficiently low, a small fraction of bits should be set, which means the array must be very large and contain long runs of zeros. The information content of the array relative to its size is low. The generalized Bloom filter (when k > 1) allows many more bits to be set while still maintaining a low false positive rate; if the parameters (k and m) are chosen well, about half of the bits will be set, and these will be apparently random, minimizing redundancy and maximizing information content.

Application

I’ll discuss the implementation details of these Bloom Filters later.

Hadoop

Hadoop Github Mirror

Cassandra


Reference: