Assignment 7: duplicate detection
Deadline: Dec. 22, 2022, 14:00 CET
In this assignment, we will work with hash functions for duplicate detection. In particular, we will use Bloom filter for memory efficient approximate duplicate detection.
A Bloom filter is a bit array, with all 0s initially. When an object is added to the filter, we run k (independent) hash functions to hash the input object, and set the bits in the Bloom filter whose indices correspond to the k hash values obtained. When checking if an object is in the filter, we check if all bits corresponding to each hash function are set to one. If so, we state that the object is probably in the filter. Otherwise the object is not in the filter.
For example, assume we have two hash functions h1 and h2, both returning 32 bit hash codes, and we are using a Bloom filter of size=101, For a given object s we obtain h1(s) and h2(s), take their module 101, and update the bits corresponding to these two indices to 1.
A Bloom filter allows checking set membership (e.g., objects seen before) without keeping the objects themselves. In fact, it is also more space-efficient in comparison to keeping the set of hash values of the objects, without causing many more false positives. The size of the filter and the number of hash function determines the number of collisions. For a filter size m that will be used for n items, the optimum number of hash functions is ln 2 * m / n.
In this assignment, we will use a Bloom filter to detect duplicate words. Detecting duplicate words, sentences, documents is a common problem, and for large corpora (with billion of items), if deduplication is needed, Bloom filter can provide a memory efficient deduplication method with some cost of false positives (discarding some non-duplicate elements).
Exercises
6.1 Implement a cyclic-shift hash function
Implement the cyclic-shift hash function defined in the template. You are free to implement any valid hash function, but you should try to avoid collisions as much as possible.
6.2 Bloom filter for eliminating duplicates from a word list
Implement the BloomFilter
class in the template,
which should use the cyclic_hash
function implemented
in the previous exercise.
You can obtain different hash functions/codes by altering
the amount of shift, or the initial hash value (or both).
Your bloom filter should never allow duplicates,
and the false-positive rate should be less than 0.1
for the provided wordlist with duplicates.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- add the honor code (I/We pledge that this code represents my/our own work)
- commit all your changes
- push it to your GitHub repository