Bloom Filter Python

http://bitworking.org/news/380/bloom-filter-resources

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Elements can be added to the set, but not removed (though this can be addressed with a counting filter). The more elements that are added to the set, the larger the probability of false positives.

http://www.google.com.hk/ggblog/googlechinablog/2007/07/bloom-filter_7469.html

在日常生活中，包括在设计计算机软件时，我们经常要判断一个元素是否在一个集合中。比如在字处理软件中，需要检查一个英语单词是否拼写正确（也就是要判断它是否在已知的字典中）；在 FBI，一个嫌疑人的名字是否已经在嫌疑名单上；在网络爬虫里，一个网址是否被访问过等等。

最直接的方法就是将集合中全部的元素存在计算机中，遇到一个新元素时，将它和集合中的元素直接比较即可。一般来讲，计算机中的集合是用哈希表（hash table）来存储的。它的好处是快速准确，缺点是费存储空间。布隆过滤器只需要哈希表 1/8 到 1/4 的大小就能解决同样的问题。

为什么(原文没说, 我的理解), 因为hash如果要work就要避免冲突, 要避免冲突就需要很大的bucket空间(bit). 而Bloom的优点时允许冲突, 但他通过增加hash函数的数量, 来减小同时冲突的概率, 所以可以用更小的空间.

而且hash table的实现往往用的是指针array, 用于指向集合元素, 而bloom的实现用的是bitarray, 因为你不需要得到这个集合元素, 只是知道他有没有.

pybloom 1.0.2

http://pypi.python.org/pypi/pybloom/1.0.2

>>> b = BloomFilter(capacity=100000, error_rate=0.001)
>>> b.add("test")
False
>>> "test" in b
True