simhash

1,SimHash

https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

64位Hash为什么海明距离选3?

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf

上链接右上precision-recall 曲线图,3是最平衡点,即不会错判太多重复,也不会漏掉很多。

SimHash第一步需抽关键词(feature),并有权重(weight,这个一般是基于统计的?没有词库的默认就用1了)

Feature算法,选 slide windows方法,golang代码如下:

func (t *OverlappingStringTokeniser) Tokenise(input string) []string {
  var chunks []string
  inputLen := len(input)
  for position := 0; position < inputLen-int(t.chunkSize); position += int(t.chunkSize - t.overlapSize) {
    chunks = append(chunks, input[position:position+int(t.chunkSize)])
  }
  return chunks
}

1)简单

2)不比其实算法效果差

计算海明距离:

// Compare calculates the Hamming distance between two 64-bit integers
//
// Currently, this is calculated using the Kernighan method [1]. Other methods
// exist which may be more efficient and are worth exploring at some point
//
// [1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
func Compare(a uint64, b uint64) uint8 {
	v := a ^ b
	var c uint8
	for c = 0; v != 0; c++ {
		v &= v - 1
	}
	return c
}

  

原文地址:https://www.cnblogs.com/gm-201705/p/8458859.html