For the background of Map/Reduce implementation on Hadoop. I have a previous post that covers the details.

### Large Scale Similarity Computation

Lets say there are N items (N is billions) and we want to find all those that are similar to each other. (similarity is defined by a distance function). The goal is to output a similarity matrix. (Notice that this matrix is very sparse and most of the cells are infinite)One naive way is to compute the similarity of each possible pairs of items, hence an O(N^2) problem which is huge. Can we reduce the order of complexity ?

### Location Sensitive Hashing

**First idea:**Find a hashing function such that similar items (say distance is less than some predefined threshold) will be hashed to the same bucket.

Lets say if we pick the hash function such that Probability(H(a) == H(b)) is proportional to similarity between a and b. And then we only perform detail comparison on items that falls into the same bucket.

Here is some R code that plots the relationship between similarity and the chance of performing a detail comparison.

```
x <- seq(0, 1, 0.01)
y <- x
plot(x, y, xlab="similarity", ylab="prob of detail compare")
```

Lets say we are interested in comparing all pairs of items whose similarity is above 0.3, we have a problem here because we have probability 0.7 = 1 - 0.3 of missing them (as they are not landing on the same bucket). We want a mechanism that is highly selective; probability of performing a detail comparison should be close to one when similarity is above 0.3 and close to zero when similarity is below 0.3.

**Second idea:**Lets use 100 hash functions and 2 items that has 30 or more matches of such hash functions will be selected for detail comparison.

Here is some R code that plots the relationship between similarity and the chance of performing a detail comparison.

```
# Probability of having more than "threshold" matches out
# of "no_of_hash" with a range of varying similarities
prob_select <- function(threshold, similarity, no_of_hash) {
sum <- rep(0, length(similarity))
for (k in 0:floor(no_of_hash * threshold)) {
sum <- sum + dbinom(k, no_of_hash, similarity)
}
return(1 - sum)
}
x <- seq(0, 1, 0.01)
y <- prob_select(0.3, x, 100)
plot(x, y, main="black: 100 hashes, Red: 1000 hashes",
xlab="similarity", ylab="prob of detail compare")
lines(x, y)
y <- prob_select(0.3, x, 1000)
lines(x, y, col="red")
```

The graph looks much better this time, the chance of being selected for detail comparison jumps from zero to one sharply when the similarity crosses 0.3

To compare the items that are similar, we first compute 100 hashes (based on 100 different hash functions) for each item and output all combination 30 hashes as a key. Then we perform pairwise comparison for all items that has same key.

But look at the combination of 30 out of 100, it is 100!/(30! * 70!) = 2.93 * 10^25, which is impractically huge. Even the graph is a nice, we cannot use this mechanism in practice.

Third idea: Lets use 100 hash function and break them into b groups of r each (ie: b*r = 100). Further let assume b = 20, and r = 5. In other words, we have 20 groups and Group1 has hash1 to hash5, Group2 has hash6 to hash10 ... etc. Now, we call itemA's group1 matches itemB's group1 if all their hash1 to hash5 are equal to each other. Now, we'll perform a detail comparison of itemA and itemB if any of the groups are equal to each other.

Probability of being selected is 1 - (1-s^r)^b

The idea can be visualized as follows

Notice that in this model, finding r and b based on s is a bit trial and error. Here we try 20 by 5, 33 by 3, 10 by 10.

```
prob_select2 <- function(similarity, row_per_grp, no_of_grp) {
return(1 - (1 - similarity^row_per_grp)^no_of_grp)
}
x <- seq(0, 1, 0.01)
y <- prob_select2(x, 5, 20)
plot(x, y,
main="black:20 by 5, red:10 by 10, blue:33 by 3",
xlab="similarity", ylab="prob of detail compare")
lines(x, y)
y <- prob_select2(x, 10, 10)
lines(x, y, col="red")
y <- prob_select2(x, 3, 33)
lines(x, y, col="blue")
```

From the graph, we see the blue curve fits better to select the similarity at 0.3. So lets use 33 by 3.

Notice that the ideal graph should be a step function where probability jumps from 0 to 1 when similarity cross over the similarity threshold that we are interested to capture (ie: we want to put all pairs whose similarity bigger than this threshold to be in a same bucket and all pairs whose similarity less that this threshold to be in a different bucket). Unfortunately, our curve is a S-curve, not a step function. This means there will be false positive and false negative. False position lies on the left side of the similarity threshold where we have a small chance to put them into the same bucket, which will cost up some extra work to compare them later and throw them away. On the other hand, false negative lies on the right side where we have a small chance of putting items that are very similar into different buckets and not considering them in the detail comparison. Depends on whether we need to catch all the similar items above the threshold, we may need shift the S curve left or right by tuning the r and b parameters.

To perform the detail comparison, we can use a parallel Map/Reduce implementation

### Map Reduce Implementation

Here we have two round of Map/Reduce. In the first round, map function will compute all the groupKeys for each item and emit the groupKey with the item. All the items that has the groupKey matches will land on the same reducer, which creates all the possible pairs of items (these are candidates for pairwise comparison).However, we don't want to perform the detail comparison in the first round as there may be many duplicates for item pairs that matches more than one group. Therefore we want to perform another round of Map/reduce to remove the duplicates.

The first round proceeds as follows ...

After that, the second round proceeds as follows ...

By combining Location Sensitive Hashing and Map/Reduce, we can perform large scale similarity calculation in an effective manner.