关于敏感哈希

Locality Sensitive Hashing

nearest neighbor

given a set P of n points in (cal R^d)

Want to build a data structure to answer nearest neighbor queries

Voronoi Diagram

Build a Voronoi diagram & a point location data structure

Voronoi图，又叫泰森多边形或Dirichlet图，它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。

Curse of dimensionality

In R2 the Voronoi diagram is of size O(n)

Query takes O(logn) time

In Rd the complexity is (O(n^{d/2}))

Other techniques also scale bad with the dimension

Locality Sensitive Hashing

We will use a family of hash functions such that close points tend to hash to the same bucket.

Put all points of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

Def (Charikar):

A family H of functions is locality sensitive with respect to a similarity function 0 ≤ sim(p,q) ≤ 1 if

[Pr[h(p) = h(q)] = sim(p,q) ]

Example – Hamming Similarity

Think of the points as strings of m bits and consider the similarity (sim(p,q) = 1-ham(p,q)/m)

$H={h_i(p) = the i-th bit of p} $is locality sensitive wrt

(sim(p,q) = 1-ham(p,q)/m)

$Pr[h(p) = h(q)] = 1 – ham(p,q)/m $

(1-sim(p,q) = ham(p,q)/m)

Example - Jaacard

Think of p and q as sets

(sim(p,q) = jaccard(p,q) = |pcap q|/|pcup q|)

(H={h_{pi}(p) = min in pi of the items in p})

(Pr[h_{pi}(p) = h_{pi}(q)] = jaccard(p,q))

Need to pick $pi $ from a min-wise ind. family of permutations

Map to {0,1}

Draw a function b to 0/1 from a pairwise ind. family B

So: (h(p) eq h(q) Rightarrow b(h(p)) = b(h(q)) = 1/2)

(H’={b(h()) | hin H, bin B})

Another example (“simhash”)

[H ={h_r(p) = 1 ,if r·p > 0, 0 otherwise | r is a random unit vector} ]

How do we really use it?

Reduce the number of false positives by concatenating hash function to get new hash functions (“signature”)

[sig(p) = h1(p)h2(p) h3(p)h4(p)…… = 00101010 ]

Very close documents are hashed to the same bucket or to ‘’close” buckets ((ham(sig(p),sig(q))) is small)

See papers on removing almost duplicates…

A theoretical result on NN

Locality Sensitive Hashing

Thm: If there exists a family H of hash functions such that

(Pr[h(p) = h(q)] = sim(p,q))

then (d(p,q) = 1-sim(p,q)) satisfies the triangle inequality

Alternative Def (Indyk-Motwani):

A family H of functions is ((r1 < r2,p1 > p2))-sensitive if

$d(p,q) ≤ r1 ightarrow Pr[h(p) = h(q)] ≥ p1 $
$d(p,q) ≥ r2 ightarrow Pr[h(p) = h(q)] ≤ p2 $

If $d(p,q) = 1-sim(p,q) then this holds with p_1 = 1-r1 and p2=1-r_2 for any r_1, r_2 $

(r,ε)-neighbor problem

If there is a neighbor p, such that d(p,q)(leq)r, return p’, s.t. d(p’,q) (leq) (1+ε)r.
If there is no p s.t. d(p,q)(leq)(1+ε)r return nothing.

is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

Lets construct a data structure that succeeds with constant probability
Focus on the hamming distance first

NN using locality sensitive hashing

Take a (r1 < r2, p1 > p2) = (r < (1+(epsilon))r, 1-r/m > 1-(1+(epsilon))r/m) - sensitive family

If there is a neighbor at distance r we catch it with probability (p_1)so to guarantee catching it we need $1/p_1 $functions..

But we also get false positives in our 1/p1 buckets, how many ? np2/p1

注1：我以后描述点集的方式可以先整个Voronoi图

注2：min-wise没看懂

注3：concatenating连接

注4：Alternative Def 另一种定义