Probabilistic data structures — Bloom Filter
If you are reading this current article ,most probably Medium will not suggest this same article to you again . But it’s not a great idea to store all the list of articles read by you and searching whether a to be recommended article is already present in the list for all the users . Instead Medium uses a data structure called “Bloom Filter” to achieve the same . Let’s delve deep into what a bloom filter is .
Disclaimer: If you know where bloom filters are used or if you are in for implementation rather than theory, please feel free to skip the next 7paragraphs.
Imagine you’re dealing with a huge list of 10 million items, and you’re often checking whether something new belongs to that list. Now, this process could get pretty slow if you have to scan the entire list every time.
Here’s the challenge: you want it fast, but you’re okay with a slight risk. You’re fine if, once in a while, the system mistakenly claims something is on the list when it’s not (probability of this mistaken claim can be controlled). However, what’s crucial is that it should never miss something that’s actually on the list.
So, you’re facing a bit of a dilemma. How do you speed things up without compromising too much accuracy? Well, here’s the clever trick: there’s something called a Bloom filter. It’s like a magic shortcut that allows you to check really quickly, even though there’s a tiny chance it might make a mistake. This trade-off, sacrificing a bit of accuracy for speed, makes your whole process much faster and more efficient. The secret here is a clever little thing we call a Bloom filter.
Now I can read your mind, when and where such accuracy tradeoffs are acceptable.
Facebook uses bloom filter to prevent what is popularly known as one hit wonder problem,
One hit wonder problem
Imagine using a social media platform with a search feature that optimizes your experience by caching frequently searched keywords.
Here’s the process: If you search for an actor like “Sabapathy” more than once, it gets stored in the cache for quick access. However, for topics like “education” searched only once, the platform avoids caching to minimize storage usage.
When you search for something new, like ‘career growth,’ ‘education’ the platform checks the Bloom filter. It might occasionally suggest that ‘career growth’ is in your previous searches, leading to caching. Conversely, it might correctly say ‘education’ is not present, avoiding caching because ‘education’ wasn’t in your searches.
In essence, the platform makes a trade-off: it occasionally caches a keyword searched only once but ensures not to miss recurring keywords searched more than once. The Bloom filter provides a quick check for keywords in your search history, even with a slight chance of false positives. This trade-off maintains a fast and efficient search experience while responding to your recurring interests.
Implementation
A bloom filter essentially needs ‘k’ hash functions and a bloom filter of size ‘m’, a bloom filter is nothing but a Boolean array (or Bit set) .
Initial Bloom filter
Insertion in Bloom filter:
For our bloom filter let’ s chose m = 15 and k = 3 (3 hash functions).
Say we want to insert a value to the list:
We will hash the value using 3 hash functions and modulo them with the size of the bloom filter (m = 15)
Here k1, k2, k3 are hash functions modulo 15 so that they return values in the range 0–14
Our choice is
k1(value) = value.hashCode() * murmurHash3(value) % 15, (k1('Mahesh') = 0)
k2(value) = value.hashCode() * 2 * murmurHash3(value) % 15 (k1('Mahesh') = 7)
k3(value) = value.hashCode() * 3 * murmurHash3(value) % 15 (k1('Mahesh') = 14)
Based on the values they return we set the bits in that particular index to 1,in the above image where we wanted to insert “Mahesh” to the list , we calculated k1,k2, and k3 for the value to be inserted are 0,6 and 14 respectively and the corresponding bits are set to 1 .
Checking membership in bloom filter:
In order to check the presence of an element we calculate the k1,k2 and k3 values of the element and if all the 3 corresponding bits are 1 then the element may probably present. In the above case k1,k2,k3 values are 0,6,and 14 respectively and all the corresponding bits are 1 . Hence the value may probably present.
In the above example , k2(“Rahu”) = 6 and the corresponding bit is not 1, hence “Rahu” is definitely not present in the list.
False Positives:
Let’s assume k1, k2, k3 values for both “Mahesh” and “Naveen” are the same. In this case isContains(“Naveen”) returns true, but we have not added “Naveen” to the list, such results are the false positive results. So , the summary
isContains(value) = true , can not be always trusted.Its probably correct
isContains(value) = false, can be trusted 100 %
Analysis:
Next question that might be coming to your mind is
What will be the probability of false positives, can we control the probability of the false positives?
In our above caching example, we can only allow certain limit of false positives or else, majority of the onetime searched elements will start occupying the cache space. So, we need to choose the number of hash functions and size of the bloom filter based on our desired probability.
k -number of hash functions.
m — size of the bloom filters.
n — estimated size of the elements.
If we insert 10 elements and check for the membership of any value in a m=15 and k = 3 bloom filter,
i.e., the result “isContains(value) = true” is correct only 35.4% of the time.
If we insert 100 elements in the same bloom filter,
i.e., the result “isContains(value) = true” is correct only 1% of the time.
So, we need to choose the value of m according to our desired probability, if we can accept only 20% of the false positive values and our audience size may be up to 10000. By substituting p=0.2
So, in order to insert the estimated count of 10000 with the false positive probability restricted to 20 % we need to use a bloom filter of size 33562 and 3 hash functions.
The problem with the bloom filter is the elements cannot be deleted once inserted, hence we have count bloom filter to overcome the same.
Thank you.
package com.learning.filters;
import java.util.ArrayList;
import java.util.Arrays;
public class BloomFilter{
private int m; // size of the bloom filter
private int k ; // number of hash functions
private int n = 0; // total count of the elements inserted in the set
private boolean[] bloomFilter;
public BloomFilter(int m, int k) {
this.m = m;
this.k = k;
this.bloomFilter = new boolean[m];
}
public static int murmur3(String key, int seed) {
byte[] data = key.getBytes();
int length = data.length;
int m = 0x5bd1e995;
int r = 24;
int h = seed ^ length;
int currentIndex = 0;
while (length >= 4) {
int k = data[currentIndex++] & 0xFF;
k |= (data[currentIndex++] & 0xFF) << 8;
k |= (data[currentIndex++] & 0xFF) << 16;
k |= (data[currentIndex++] & 0xFF) << 24;
k *= m;
k ^= k >>> r;
k *= m;
h *= m;
h ^= k;
length -= 4;
}
switch (length) {
case 3:
h ^= (data[currentIndex++] & 0xFF) << 16;
case 2:
h ^= (data[currentIndex++] & 0xFF) << 8;
case 1:
h ^= data[currentIndex] & 0xFF;
h *= m;
}
h ^= h >>> 13;
h *= m;
h ^= h >>> 15;
return h;
}
public void setAllBitsToZero() {
this.n = 0;
for(boolean i : this.bloomFilter) {
i = false;
}
}
public ArrayList<Integer> getBitArrayIndices(String name) {
ArrayList<Integer> indices = new ArrayList<Integer>();
int seed = 42;
int i = 0;
for(i = 1; i <= this.k; i++) {
indices.add(Math.abs((name.hashCode() * i * murmur3(name, seed) % this.m)));
}
return indices;
}
public void add(String name) {
for (Integer i : getBitArrayIndices(name)) {
this.bloomFilter[i] = true;
}
this.n+=1;
}
public boolean contains(String name) {
for (Integer i : getBitArrayIndices(name)) {
if (!this.bloomFilter[i]) {
return false;
}
}
return true;
}
public static void main(String args []) {
BloomFilter bloomFilter = new BloomFilter(10,3);
bloomFilter.add("Mahesh");
System.out.println("BloomFilter value"+ Arrays.toString(bloomFilter.bloomFilter));
System.out.println("Element exists"+bloomFilter.contains("Mahesh"));
}
}
import mmh3
import math
class BloomFilter(object) :
def __init__(self , m, k):
self.m = m # size of bloom filter
self.k = k # number of hash functions
self.n = 0 # total count of elements inserted in the set
self.bloomFilter = [0 for i in range(self.m)]
def _setAllBitsToZero(self):
self.n = 0
for i in self.bloomFilter:
self.bloomFilter[i] = 0
def getBitArrayIndices(self, item):
'''
hashes the key for k defined,
returns a list of the indices which have to be set
'''
indexList = []
for i in range(1, self.k + 1):
indexList.append((hash(item) + i * mmh3.hash(item)) % self.m)
return indexList
def add(self, item):
'''
Insert an item in the filter
'''
for i in self.getBitArrayIndices(item):
self.bloomFilter[i] = 1
self.n += 1
def contains(self, key):
'''
returns whether item exists in the set or not
'''
for i in self.getBitArrayIndices(key):
if self.bloomFilter[i] != 1:
return False
return True
def generateStats(self):
'''
Calculates the statistics of a filter
Probabilty of False positives , predicted False positive rate n, m ,k
'''
n = float(self.n)
m = float(self.m)
k = float(self.k)
probability_fp = math.pow((1.0 - math.exp(-(k*n)/m)), k)
print("Number of elements entered in filter:", n)
print("Number of bits in filter:", m)
print("Number of hash functions used:", k)
print("Predicted Probabilty of false positives:" , probability_fp)
print("Predicted false positive rate:", probability_fp * 100.0, "%")
def reset(self):
'''
Resets the filter and clears old values and statistics
'''
self._setAllBitsToZero()