resemblance with the jaccard coefficient

## << back to other nerdy projects

## part 1: resemblance with the jaccard coefficient

## part 2: fastmap projection using jaccard distances

## part 3: the simhash algorithm

## part 4: a sketching algorithm

## huh?

i started working on another rss feed classification technique using a data duplication algorithm to classify articles.

the idea is that an article can be classified by determining which class it is most likely a duplicate of.

however half way through i realised this technique could work against a problem we were seeing at work and changed to start work on that data instead

it's a bit sad i know but data is data and it's still an interesting problem.

i'll use nothing but publicly available data for this, and if it looks promising i might get a chance to work on it further during business hours!all discussed ruby/c++ code is available from http://github.com/matpalm/resemblance

## so what is the actual problem?

given two very similiar business names, address pairs can we decide if they are actually the same company?

let's consider some examples...eg1

Burra Hotel, 5 Market Sq, Burra, SA, 5417

Camping Country Superstore, 401 Pacific Hwy, Belmont North, NSW, 2280

it's pretty obvious these are not the same company. next!eg2

One Stop Bakery, 1304 High St Rd, Wantirna, VIC, 3152

One Stop Bakery, 1304 High Street Rd, Wantirna South, VIC, 3152

i think these are the same, it's just one is using an abbrev for street.eg3

Park Beach Interiors, Showroom Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450

Park Beach Interiors, Showroom Park Beach Plaza Pacific Highway, Coffs Harbour, NSW, 2450

Park Beach Interiors, Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450

Park Beach Interiors, 26 Park Beach Plaza, Pacific Hwy, Coffs Harbour, NSW, 2450

i think these areallthe same.eg 4

Weaver Interiors, 955 Pacific Hwy, Pymble, NSW, 2073

Weaver Interiors, 997 Pacific Hwy, Pymble, NSW, 2073

this pair is interesting.... theymightbe the same, but maybe not...eg 5

Gibbon Hamor Commercial Interiors, 233 Johnston St, Annandale, NSW, 2038

Gibbon Hamor Development Planners, 233 Johnston St, Annandale, NSW, 2038

this pair is also interesting for the same reasons.## shingling

shingling is a way of generating a set that represents a bit of data which can be used for comparisons

eg. the 4 bigram shingles of"the cat sat on the cat"are...

- the cat
- cat sat
- sat on
- on the
(note: this is a set so we only count the shingle "the cat" once)

## the jaccard index

the jaccard index is a simple measure of how similiar two sets are.

it's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.eg. if

J(A,B)is jaccard index between setsAandB

andA = {1,2,3}, B = {2,3,4}, C = {4,5,6},

thenJ(A,B) = 2/4 = 0.5,

andJ(A,C) = 0/6 = 0,

andJ(B,C) = 1/5 = 0.2

so the most "similiar" sets areAandBand the least similiar areAandC

(note alsoJ(A,A) = J(B,B) = J(C,C) = 1)## putting it all together

so given two business name/addresses we can build a shingling set for each and use the jaccard index to decide how similiar they are.

we'll use bigrams for building our sets but lets use character bigrams, not word bigrams.

this is since the documents are quite small and we want to include puncutation in the comparisons...lets run through our above examples again...

eg 1

Burra Hotel, 5 Market Sq, Burra, SA, 5417

is represented by the set of 2 character-gram shingles

{" 5", " B", " H", " M", " S", ", ", "17", "41", "5 ", "54", "A,", "Bu", "Ho", "Ma", "SA",

"Sq", "a ", "a,", "ar", "el", "et", "ke", "l,", "ot", "q,", "ra", "rk", "rr", "t ", "te", "ur"}Camping Country Superstore, 401 Pacific Hwy, Belmont North, NSW, 2280

is represented by the set of 2 character-gram shingles

{" 2", " 4", " B", " C", " H", " N", " P", " S", ", ", "01", "1 ", "22", "28", "40", "80",

"Be", "Ca", "Co", "Hw", "NS", "No", "Pa", "SW", "Su", "W,", "ac", "am", "c ", "ci", "e,", "el", "er", "fi",

"g ", "h,", "ic", "if", "in", "lm", "mo", "mp", "ng", "nt", "on", "or", "ou", "pe", "pi", "re", "rs", "rt",

"ry", "st", "t ", "th", "to", "tr", "un", "up", "wy", "y ", "y,"}

they have an intersection size of 6 shingles and a union size of 87 shingles, hence a jaccard index of 6/87 = 0.068

eg 2

One Stop Bakery, 1304 High St Rd, Wantirna, VIC, 3152 and

One Stop Bakery, 1304 High Street Rd, Wantirna South, VIC, 3152

have an intersection size of 46 shingles and a union size of 57 shingles, hence a jaccard index of 46/57 = 0.807eg 3

a) Park Beach Interiors, Showroom Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450

b) Park Beach Interiors, Showroom Park Beach Plaza Pacific Highway, Coffs Harbour, NSW, 2450

c) Park Beach Interiors, Park Beach Plaza Pacific Hwy, Coffs Harbour, NSW, 2450

d) Park Beach Interiors, 26 Park Beach Plaza, Pacific Hwy, Coffs Harbour, NSW, 2450

have indexes J(ab)=0.888, J(ac)=0.861, J(ad)=0.808, J(bc)=0.760, J(bd)=0.716, J(cd)=0.932eg 4

Weaver Interiors, 955 Pacific Hwy, Pymble, NSW, 2073 and

Weaver Interiors, 997 Pacific Hwy, Pymble, NSW, 2073

have an intersection size of 43 shingles and a union size of 49 shingles, hence a jaccard index of 43/49 = 0.877eg 5

Gibbon Hamor Commercial Interiors, 233 Johnston St, Annandale, NSW, 2038 and

Gibbon Hamor Development Planners, 233 Johnston St, Annandale, NSW, 2038

have an intersection size of 49 shingles and a union size of 76 shingles, hence a jaccard index of 49/76 = 0.644## conclusion

though there is no obvious magic cutoff point it seems to give pretty good values.

it would find some obvious duplicates, though would require a bit of human double checking to make sure.here's a histogram of the frequency of resemblance values from the comparison of all pairs of 2000 name addresses

(a total of 1,999,000 comparisons and notice the y scale is logarithmic)

## algorithmic discussion

## order n squared sucks

the jaccard coefficient is, unfortunately, not transistive

(ie if we know J(A,B) and J(B,C) it tells use nothing about J(A,C)naively then to determine the pair with the highest similarity requires we compare

every element with.

every other element

this is O(n^{2}) and O(n^{2}) sucks since we are looking at (n(n-1))/2 comparisons, joy!lets examine some of the ruby runtimes

num records comparisons time 50 1,225 0.2s 100 4,950 0.9s 250 31,125 5.6s 500 124,750 24s 750 280,875 52s 2000 1,999,000 6m 57s and just say i ran this over a subset of the full data, say, 1,000,000 records

it would be 499,999,500,000 comparisons

and at about 300,000 per minute we'll be here till christmas (2011)( luckily the actual data allows me to do something which reduces the runtime to be O(n) but i'm not going to talk about it out of work)

## bit level optimisation in c++

i decided to reimplement this in c++ and go the whole hog by using a bit level representation of the data to wring everything out of the machine.

the big question is: how to optimise the jaccard index calculation? it's where the time is spent.

consider the shingle sets for "cat" and "mat", ie

{"ca","at"}and{"ma","at"}

we can convert shingles to ints by taking all the unique ones and mapping them to ints from a sequence starting at 0

ie{ "ca" => 0, "at" => 1, "ma" => 2}

giving us the two equivalent shingle sets{0,1}and{2,1}

finally we can use the values in these sets to set bits in a nibble

giving us the two nibbles0011(setting bits 0 and 1) and0110(setting bits 2 and 1)now consider the bit representations and the results of the bitwise operators | and &

0011(equivalent to{"ca","at"})

0110(equivalent to{"ma","at"})

& 0010=> and'ing the bits strings gives us their intersection!

| 0111=> or'ing the bits strings gives us their union!the number of bits set in x0010 (size of intersection) is 1 and

the number of bits set in x0111 (size of union) is 3

so the jaccard index of "cat" and "mat" is 1/3

note: we can count the number of bits set with a crazy bit of c like

inline int count_number_bits_set(long l) {

unsigned int c;

for(c=0;l;c++)

l &= l-1;

return c;

}

(thanks to brian kernighan for that one)

using this method we can calculate the union or intersection of a 4 byte long (ie 32 set elements) in a single | or &!

bamm!finally we can use the awesome openmp library ( available as part of gcc since 4.2 )

with two additional lines of code (both pragma statements) we can give hints to the compiler where the code can be multithreaded

num records comparisons ruby time c++ time c++ openmp time 50 1,225 0.29s 0.008s 0.013s 100 4,950 0.97s 0.01s 0.013s 250 31,125 5.5s 0.04s 0.04s 500 124,750 22s 0.12s 0.09s 1000 499,500 1m 30s 0.37s 0.2s 2000 1,999,000 6m 34s 1.2s 0.5s 4000 7,998,000 ? 7.4s 1.8s 8000 31,996,000 ? 21s 6.2s 16000 127,992,000 ? ? 26s so the ruby code is getting about 5,000 a second

the single threaded c++ implementation is getting about 1,500,000 a second

and the c++ implementation using openmp on a quad core box (utilising about 350% cpu) is getting about 5,000,000 a second

this is a speed up of about 1,000 times

booya! that's more like it!now lets consider the jaccard distance after which we'll consider the simhash algorithm as a way of avoiding all that O(n

^{2}) nastiness.