Bloom Filters

API#

bf.add(x): adds x in the data structure
x in bf: tests if x is in the data structure.

Why not use Set or Dict?#

Bloom filters are more space efficients. They take memory lesser than the keys themselves.

Cons#

can’t store associated data.
does not support deletions.
It is probabilistic data structure. That means, x in bf might have false positives. There are no false negatives.

Applications#

Spell checkers: (40 years ago) Add the dictionary into the filter. If the word is in the bloom filter, it is higly likely tobe correctly spelled word.
list of forbidden password. E.g., too common password.
Modern applications: Routers, a lot of packets incoming.

How does it work?#

we have a data set S of size |S|.
have an array of n bits.
$b = \frac{n}{|S|}$ bits per element.
have k hash functions.

import numpy as np

class BitArray:
    def __init__(self, n):
        self.array = np.zeros((n >> 3) + 1, dtype=np.uint8)
    
    def set(self, i):
        i, j = divmod(i, 8)
        self.array[i] = self.array[i] | (1 << j)
    
    def get(self, i):
        i, j = divmod(i, 8)
        return bool(self.array[i] & (1 << j))

import xxhash

def get_hashes_for_str(x, k):
    # Only h1 and h2 are computed afresh. Remaining hash 
    # functions are just linear combinations of h1 and h2.
    
    assert k > 1
    
    h1 = xxhash.xxh64_intdigest(x, seed=42)
    h2 = xxhash.xxh64_intdigest(x, seed=84)
    hashes = [h1, h2]
    
    for i in range(2, k):
        hashes.append(h1 + i*h2)
    return hashes

class BloomFilter:
    def __init__(self, n, k):
        self.array = BitArray(n)
        self.k = k
        self.n = n
        
    def add(self, key):
        for h in get_hashes_for_str(key, self.k):
            self.array.set(h % self.n)
            
    def __contains__(self, key):
        for h in get_hashes_for_str(key, self.k):
            if not self.array.get(h % self.n):
                return False
        return True

text = "correct incorrect".split()
bm = BloomFilter(10, 5)

for word in text:
    bm.add(word)

for word in text:
    assert word in bm

assert "notcorrect" not in bm

Analysis#

Trade off between erro rate and space required. Data structre will be useful when there is a sweetspot on the tradeoff curve.

Assumption: The hash functions are independent and they distribute the data uniformly.

Out of n bits, focus on a particular bit. What is the probability that it has been set?

It is probability of 1 minus it hasn’t been set by any of the |S| elements, for any of the k hash functions.

$p = 1 - (1 - \frac{1}{n})^{k|S|}$.

As can be seen by the plot below, $1+x$ can be approximated by $e^x$ when x is close to zero. In all cases, $e^x$ is an overestimate of $1+x$.

import matplotlib.pyplot as plt
x = np.linspace(-2, 2)
plt.plot(x, 1+x, label="$1 +x$")
plt.plot(x, np.exp(x), label="$e^x$")
plt.legend();

output image for above cell

<Figure size 640x480 with 1 Axes>

Thus, $p \approx 1 - e^{\frac{-k|S|}{n}} = 1 - e^{\frac{-k}{b}}$. Remember, b is the bits per element.

As $b \to \inf$, $p \to 0$.

Thus, the probability of false positive is $\epsilon = (1-e^{\frac{-k}{b}})^k$.

How to set K?#

Set K optimally. Fix the b, then set k to minimize the error. Using calculus, $k \approx (ln 2) b$.

When we plug this back into the p, we get $\epsilon \approx (\frac{1}{2})^\left((ln 2) b\right)$. Notice that error rate is exponential in b.

Using little bit of algebra we can get $b \approx 1.44 log_2{\frac{1}{\epsilon}}$. (Hint: $1.44 \approx \frac{1}{ln2}$.)

How does theory differ from practice#

import string
import random
import numpy as np

letters = np.array(list(string.ascii_letters + string.digits))

def random_str():
    return ''.join(np.random.choice(letters, np.random.randint(4, 9)))

Generate 1M keys, and insert them in the bloom filter. Just for sanity check, I will test if there is no false negative. Then I will generate keys at random, check if it is in actually there, and how often does bloom filter give false positive. Recall that error rate as a function of b = n/10M is $\approx (\frac{1}{2})^\left((ln 2) b\right)$.

actually_there = set()
set_size = int(1e6)
while len(actually_there) != set_size:
    actually_there.add(random_str())

from tqdm.auto import tqdm

bs = [4, 8, 12, 16]
fpve_rate = []
theory = []

for b in bs:
    theory.append(.5**(np.log(2)*b))
    bm = BloomFilter(b*set_size, k=int(np.log(2)*b))
    for key in tqdm(actually_there, "Inserting"):
        bm.add(key)
    
    for key in tqdm(actually_there, "Sanity Checking"):
        assert key in bm
        
    fpve, total = 0, 0
    while total < 10_000:
        key = random_str()
        if key in actually_there: continue
        if key in bm:
            fpve += 1
        total += 1
    
    fpve_rate.append(fpve/total)

Inserting:   0%|          | 0/1000000 [00:00<?, ?it/s]

Sanity Checking:   0%|          | 0/1000000 [00:00<?, ?it/s]

Inserting:   0%|          | 0/1000000 [00:00<?, ?it/s]

Sanity Checking:   0%|          | 0/1000000 [00:00<?, ?it/s]

Inserting:   0%|          | 0/1000000 [00:00<?, ?it/s]

Sanity Checking:   0%|          | 0/1000000 [00:00<?, ?it/s]

Inserting:   0%|          | 0/1000000 [00:00<?, ?it/s]

Sanity Checking:   0%|          | 0/1000000 [00:00<?, ?it/s]

bs_lin = np.linspace(bs[0], bs[-1])
plt.plot(bs_lin, .5**(np.log(2)*bs_lin), label="Theory", c='b') 
plt.plot(bs, theory, 'o', c='b')
plt.plot(bs, fpve_rate, '-x', label="Practice", c='orange')
plt.legend();
plt.xlabel("bits per element")

Text(0.5, 0, 'bits per element')

output image for above cell

<Figure size 640x480 with 1 Axes>

fpve_rate

[0.1551, 0.0211, 0.0024, 0.0003]

at b=12, false positive rate is already less than 1%