Understanding Bloom Filters and Building a Custom One

Introduction

Let us first consider that we are performing operations on strings. To efficiently do so, we use data structures like tries, suffix trees, etc. However, the problem is that we need to store the whole string to compute the desired answers, which leads us to a complexity of O(Σni) where ni is the length of the ith string.

To optimize, we can store only the hash of the string and perform operations on it. However, if you have a large amount of data, then opting for multiple hash functions is advisable to avoid collisions caused by a fixed bucket range. This approach reduces collisions, but at a larger scale, more hash functions are needed, which may lead to space inefficiency. Therefore, Bloom filters are often preferred.

To understand more, check out my blog on HashMap here.

Understanding Bloom Filters

Let's use a simple example to understand Bloom Filters, which are probabilistic data structures that can return values as false positives and false negatives.

False Positive: If the Bloom filter indicates that two strings are the same, then they may or may not be the same.
False Negative: If the Bloom filter indicates that two strings are not the same, then they are definitely not the same.

Here is the algorithm:

First, we create an array of length n (where n varies based on the application size) and set all bits to zero.
Next, let's take a string s1 as "Apple" and pass it to a hash function. This hash function should be fast, and its values should be evenly and randomly distributed. Collisions are acceptable if they are rare; hence, we use the Murmur hash function.
The hash function will return the index values when the string is passed through it, and then we set those indexes to 1.
Now, we pass the string s2, "Banana," to compare it with string s1. We use the same hash function on s2 and check whether the returned indexes are set in the array or not. If they are set, then the strings may be the same; if not set, then they are definitely not the same.
When 70% to 80% (depending on the application) of the array is filled, we perform a reset operation, increase the size of the array, and repeat the hashing process (as the hash function depends on the array size).

As one of the indexes is not set of the banana of which the hash function is returned thus both the strings are not equal to each other.

Multi-Layer Bloom Filters

One frequently used use case of Bloom filters is to store data in the cache that is searched repetitively. However, in reality, there are many one-hit wonders on the web. Therefore, we use multilayer Bloom filters. We put the data in the cache only when it has been searched k times.

In this case, let's consider k=3. Thus, we maintain k arrays of the same size and will set each layer every ith time it is searched, where k ranges from 1 to k as shown in the figure below.

Probability of False Positive

The only caveat of Bloom Filters is false positives, which means the Bloom filter may return true even if the string is not present. This inefficiency can be attributed to the hash functions. Below is the proof of the probability of occurrence of a false positive.

A Bloom filter can be represented as F = {0,1}^m. For each word, hashing involves setting k indexes in F to 1, such as F[ h_i(w) ] = 1 where h_i is the i-th hash function in { h_1, ..., h_k }. We repeat this process for all words in the dictionary or input list.

Initially, we set all values of F to 0, so F = {0}^m. Then, after applying one hash function like h_1(w) = index, F[ index ] = 1.

The probability of finding a bit set to one is literally one in m since there is only a single 1 in F:

p1 = 1/m.

Therefore, the probability of finding a bit set to zero is the inverse:

p0 = 1 - p1 = 1 - (1/m).

After applying all k hash functions to the first word, the probability of finding a bit set to zero becomes p' = (1-1/m)^k. Then, after hashing all words or until iteration i of the list of words, the probability of finding a bit set to zero will be p = (1 - (1/m))^(k*i).

Now, the probability of getting one for a random word w, p(F[h_i(w)]=1), is the inverse of p, which is

q = 1 - p = 1 - (1 - (1/m))^(k*i).

To calculate the probability of a false positive (an error), we want the probability of getting k consecutive 1 values in F. However, this calculation assumes that each hash function generates a different result, avoiding collisions. If we assume all other h_i(w) are distinct, then

p' = 1 - ((k-1)/m).

However, this assumption may not hold in practice, especially if three or more hashes are the same, leading to inconsistencies in the formula.

Intuitively, when we don't know the exact number of ones in our filter, simulating the probability of drawing a zero can be viewed as not drawing one for all possible values of h_i(w). Therefore, drawing a zero in this scenario is like not drawing one for each hashed index, leading to a cumulative probability calculation.

However, interpreting this formula probabilistically can be challenging, especially when considering multiple events happening in succession. The conclusions and the reasoning behind the above proof is as follows:

In each of these equations, increasing the value of k (the number of hash functions) will make the probability of a false positive less likely. However, it is not computationally efficient to have an enormous value for k.

To minimize this equation, we must choose the best k. We do this because we assume that the programmer has already chosen an m based on their space constraints and has some idea of what their potential n will be. Therefore, the k value that minimizes this equation is given by:

k = (m/n) * ln(2).

Code

package main

import (
    "fmt"
    "hash"
    "math/rand"
    "github.com/google/uuid"
    "github.com/spaolacci/murmur3"
)

var hashfns []hash.Hash32

func init() {
    hashfns = make([]hash.Hash32, 0)
    for i := 0; i < 100; i++ {
        hashfns = append(hashfns, murmur3.New32WithSeed(rand.Uint32()))
    }
}

func murmurhash(key string, size int32, hashFnIdx int) int32 {
    hashfns[hashFnIdx].Write([]byte(key))
    result := hashfns[hashFnIdx].Sum32() % uint32(size)
    hashfns[hashFnIdx].Reset()
    return int32(result)
}

type BloomFilter struct {
    filter []uint8
    size   int32
}

func NewBloomFilter(size int32) *BloomFilter {
    return &BloomFilter{
        filter: make([]uint8, size),
        size:   size,
    }
}

func (b *BloomFilter) Add(key string, numHashfns int) {
    for i := 0; i < numHashfns; i++ {
        idx := murmurhash(key, b.size, i)
        aIdx := idx / 8
        bIdx := idx % 8
        b.filter[aIdx] = b.filter[aIdx] | (1 << bIdx)
    }
}

func (b *BloomFilter) Print() {
    fmt.Println(b.filter)
}

func (b *BloomFilter) Exists(key string, numHashfns int) (string, int32, bool){
    for i := 0; i < numHashfns; i++ {
        idx := murmurhash(key, b.size, i)
        aIdx := idx / 8
        bIdx := idx % 8
        exist := b.filter[aIdx]&(1<<bIdx) > 0
        if !exist {
            return key, idx, false
        }
    }
    return key, 0, true
}

func main() {
    dataset := make([]string, 0)
    dataset_exists := make(map[string]bool)
    dataset_notexists := make(map[string]bool)

    for i := 0; i < 500; i++ {
        u := uuid.New()
        dataset = append(dataset, u.String())
        dataset_exists[u.String()] = true
    }

    for i := 0; i < 500; i++ {
        u := uuid.New()
        dataset = append(dataset, u.String())
        dataset_notexists[u.String()] = false
    }

    for i := 1; i < len(hashfns); i++ {
        bloom := NewBloomFilter(int32(10000))

        for key, _ := range dataset_exists {
            bloom.Add(key, i)
        }

        falsePositive := 0
        for _, key := range dataset {
            _, _, exists := bloom.Exists(key, i)
            if exists {
                if _, ok := dataset_notexists[key]; ok {
                    falsePositive++
                }
            }
        }
        fmt.Println((float64(falsePositive) / float64(len(dataset))))
    }
}

This code implements a Bloom filter with multiple hash functions to check for the existence of items in a dataset. It calculates the false positive rate for items that should not be in the dataset but are erroneously identified as present. I chose Go because of its built-in Murmur hash code, rather than writing the hash function from scratch, and the choice of language and hash function may vary according to the situation.

If we are using a bloom filter with m bits and k hash function, insertion and search will both take O(k) time. In both cases, we just need to run the input through all of the hash functions. Then we just check the output bits.

To truly understand the space complexity of the bloom filter, you have to first choose your parameters. You could make a bloom filter with k\=1 and it would just be a hash table that ignores collisions. However, you would have a very large m if you wanted to keep your false positive rate low. The space of the actual data structure (what holds the data) is simply O(m)

Applications

Akamai uses Bloom filters for caching, as 75% of the requests are one-hit wonders.
Big data systems like Apache Cassandra and HBase use Bloom filters for quick data retrieval.
Bloom filters are used to check whether a password is weak or not.
Meta uses Bloom filters for optimizing data retrieval.

Final Thoughts

Bloom filters excel in memory efficiency and approximate membership queries but come with a risk of false positives. Hash maps offer exact key-value retrieval with zero false positives but may consume more memory, especially for large datasets. The choice between the two depends on the specific requirements of your application.

Resources

Wikipedia page on Bloom Filters: Wikipedia page on Bloom Filters
Brilliant page on Bloom filters implemented using hashlib in Python: Brilliant page on Bloom Filters
Twitter Space Recording of Arpit Bhayani on Bloom Filters: Twitter Space Recording of Arpit Bhayani on Bloom Filters
Meta Blog on Ribbon Filter: Practically Smaller than Bloom and XOR

Understanding Bloom Filters and Building a Custom One

Introduction

Understanding Bloom Filters

Multi-Layer Bloom Filters

Probability of False Positive

Code

Applications

Final Thoughts

Resources

Comments

More from this blog

Understanding webRTC from it's first Principles

Cryptography and Digital Signatures

Tackling the Thundering Herd Problem and Designing Idempotent API Endpoints for Payments at Stripe with Rate Limiting

Implementing a Custom HashMap in Java

Command Palette

Introduction

Understanding Bloom Filters

Multi-Layer Bloom Filters

Probability of False Positive

Code

Applications

Final Thoughts

Resources

Comments

More from this blog