Bloom Filter

1。布隆过滤器原理

如何判断一个元素是否在集合中？

一般想法是将集合所有元素保存起来，然后通过查找确定。Hashtable比较合适，O(1)的平均时间复杂度，但Hashtable一个比较大的问题是空间效率，一般存储n个元素，需要2n大小空间。

而Bloom Filter实现同样的功能，使用更小的空间。

原理：

1）一个非常大的bit数组（len=m），2）一组哈希函数（size=k）

bit数组置0，将元素通过k个哈希函数计算，将bit数组相应offset置位为1。查找时，亦将元素通过k个哈希函数，分别检查offset bit位，若任一个为0，则该元素一定不存在集合中。

Bloom Filter只能用来判断不存在的情况。判断存在有false-positive，即如果检查全部是1，也不能确定是否存在。

2。应用

1）防缓存击穿，2）爬虫URL去重，3）垃圾邮件地址过滤

3。Go语言实现

实现2个函数：

type Interface interface {
Add(item []byte) // 元素加入bloom filter
Test(item []byte) bool // 检查元素是否存在
}

定义Bloom Filter数据结构：

// BloomFilter probabilistic data structure definition
type BloomFilter struct {
    bitset []bool      // 简单起见，使用bool数组代替bitset
      k      uint         // hash函数个数
      n      uint         // bloom filter中已存在元素个数
      m      uint         // bitset长度
      hashFuncs[]hash.Hash64           // hash函数数组
}

Add函数：

func (bf *BloomFilter) Add(item []byte) {
hashes := bf.hashValues(item)
i := uint(0)

     for {
            if i >= bf.k {
                  break
            }

position := uint(hashes[i]) % bf.m
bf.bitset[uint(position)] = true

i+= 1
}

bf.n += 1
}

Test函数：

func (bf *BloomFilter) Test(item []byte) (exists bool) {
      hashes := bf.hashValues(item)
      i := uint(0)
      exists = true

      for {
            if i >= bf.k {
                  break
            }

            position := uint(hashes[i]) % bf.m
            if !bf.bitset[uint(position)] {
                  exists = false
                  break
            }

i+= 1
}

return
}

4。bitset版本

上面版本使用的是伪的bitset，提高内存效率可以使用真正的bitset代替bitset []bool

https://github.com/willf/bitset

bitset []bool -> bitset.BitSet即可

另外支持分布式的Bloom Filter可以使用redis的bitmap实现。

————

前几天有个需求，检查每个用户是否第一次访问某个页面，使用了redis的Hashtable，field存shopid，真的蠢了，应该用bitmap