Locality Sensitive Hashing using Euclidean Distance

It’s quite similar to Locality Sensitive Hashing (LSH) for Cosine Similarity which we covered earlier. I will be referring to the same here, so it’s better if you go through the same before proceeding.

The difference lies in the way we compute hash value. As we have seen we can divide the region using planes. In each region we can have data-points.

Follow these steps (refer to diagram)

LSH for Euclidean
LSH for Euclidean

1. Divide the plane into small parts.
2. Project each data-point on the planes.
3. For each datapoint take the distance along each plane and use it to calculate the hash value.

Rest of the procedure is similar to cosine similarity process. Like finding the nearest neighbor.

Locality Sensitive Hashing using Cosine Similarity

The problem we are trying to solve is to predict the class of a new data point, given a dataset with pre-classified data points.

Two key ideas we will use here areĀ  k-NN algorithm and LSH. If you don’t know about these concepts then I will suggest you to check them out first.

What is Cosine Similarity?
At a high level cosine similarity can tell us how similar two points are. To do this we compute the vector representation for the two points and then find the angle between the two vectors.

The similarity between vectors a and b can be given by cosine of the angle between them.

We can use this concept to calculate the hash value for a data point.

Now that we know cosine similarity we can use this to calculate LSH values for data points. To do this we divide the space using hyperplanes.

Refer to the image below to understand each of the points explained next.

Hashing using Cosine Similarity
Hashing using Cosine Similarity

For simplicity consider a 2-D space with X-Y axis. We can divide this into 4 regions by using 2 planes / lines L1 and L2.

So a data point “A” will reside in one of these regions. For each plane we can find in which direction the point “A” lies, by using the concept of normal vector.

This way we can find the value for each plane. For each plane the value will be either +1 or -1. We can use this to calculate Hash Key.

Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.

Say the new point lands in the bucket with key =1. Then we know it’s near to the points A,B. Next apply k-NN to find it’s classification.