Locality Sensitive Hashing using Euclidean Distance

It’s quite similar to Locality Sensitive Hashing (LSH) for Cosine Similarity which we covered earlier. I will be referring to the same here, so it’s better if you go through the same before proceeding.

The difference lies in the way we compute hash value. As we have seen we can divide the region using planes. In each region we can have data-points.

Follow these steps (refer to diagram)

LSH for Euclidean
LSH for Euclidean

1. Divide the plane into small parts.
2. Project each data-point on the planes.
3. For each datapoint take the distance along each plane and use it to calculate the hash value.

Rest of the procedure is similar to cosine similarity process. Like finding the nearest neighbor.

Locality Sensitive Hashing using Cosine Similarity

The problem we are trying to solve is to predict the class of a new data point, given a dataset with pre-classified data points.

Two key ideas we will use here are  k-NN algorithm and LSH. If you don’t know about these concepts then I will suggest you to check them out first.

What is Cosine Similarity?
At a high level cosine similarity can tell us how similar two points are. To do this we compute the vector representation for the two points and then find the angle between the two vectors.

The similarity between vectors a and b can be given by cosine of the angle between them.

We can use this concept to calculate the hash value for a data point.

Now that we know cosine similarity we can use this to calculate LSH values for data points. To do this we divide the space using hyperplanes.

Refer to the image below to understand each of the points explained next.

Hashing using Cosine Similarity
Hashing using Cosine Similarity

For simplicity consider a 2-D space with X-Y axis. We can divide this into 4 regions by using 2 planes / lines L1 and L2.

So a data point “A” will reside in one of these regions. For each plane we can find in which direction the point “A” lies, by using the concept of normal vector.

This way we can find the value for each plane. For each plane the value will be either +1 or -1. We can use this to calculate Hash Key.

Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.

Say the new point lands in the bucket with key =1. Then we know it’s near to the points A,B. Next apply k-NN to find it’s classification.

What is a k-d tree

k-d tree is a binary-tree based data structure. It can be used for data which is k-dimensional.

What do we mean by k-dimensional?
You may have heard of 2-D and 3-D. Similarly, we have higher dimension space.

Why do we need k-d tree?
Using k-d tree we can partition a k-dimensional space into regions. This allows for efficient searching. It’s used in computer graphics and nearest neighbour searches.

How it works?
Let us consider a 2-D dataset. A point in this can be represented as <X,Y>

Constructing a k-d tree

We follow these steps to construct the tree.

1. Select a dimension and project all points along that. Example: Project all points along X-axis.
2. Take the mean of points generated along X-axis. Let’s call it X_M
3. Split using the mean. If a point X1 is < X_M then it goes to the left sub-tree else right.
4. Repeat steps 1-3 for all dimensions.

By constructing this tree we have partitioned the space into smaller regions. Given a new query point we can traverse through the Tree to find an appropriate region.

What is Locality Sensitive Hashing

It’s a beautiful technique to find similar points that is nearest neighbours. It uses the concept of Hashing for this.

You may be familiar with how Hash Table is constructed. If not, please follow this link for a quick refresher on Hashing concepts. It’s a very efficient data structure which allows us to perform operations in O(1) time.

How it Works?

We apply a hash function to each available data point. This gives us the bucket or the key where the point can be placed.

We try to maximize collisions, so that similar points go to same bucket.

Now if we get an unlabelled query point and apply the hash function then it will go to same bucket where similar points exist.

This makes it easy to find the nearest neighbors for the query point. And subsequently we can apply k-NN to predict it’s class.

What is Weighted k-NN Algorithm

We have seen the basic k-NN algorithm. You may want to read that first to better understand this concept.

Using k-NN we try to predict the class of a query point “Q” (an unclassified data-point). To do this we use the concept of neighbours or points that are similar or close to the query point and then do majority vote.

But we miss something here. Suppose we have 5 neighbours. Don’t you think that the points which are closer to Q needs to be given more importance?

That’s what we do here. We assign weights to the data-points based on their distance from Q. One way to calculate weight is to divide distance by 1 i.e. weight = 1/distance

Consider the following neighbours of Q. Here we have added a new column Weight = 1/Distance

Point Label Distance Weight
A Positive 0.1 10
B Negative 0.2 5
C Positive 0.4 2.5
D Negative 0.5 2

Now that we have weights assigned to each neighbour we can use weighted k-NN to predict the Label for Q. To do this we calculate positive and negative sums.

Positive Sum
We calculate sum of weights for positive labels i.e. A,C : 10+2.5 = 12.5

Negative Sum
We calculate sum of weights for negative labels i.e. B,D : 5+2 = 7

Since Positive Sum > Negative Sum we can say that for Q the label will be Positive.

Time Based Splitting of Data

In a Machine Learning algorithm we can split the given dataset into training and test data. We can either split randomly or use time based splitting.

For time based splitting we need a timestamp as one of the attributes / features.

Like in case of e-commerce website we can have reviews for various products. These reviews can have timestamps also. In such scenarios it’s better to use time-based strategy.

To do this we can first sort the reviews using timestamp and then do the split.

  1. Sort data by time.
  2. Split – Training (80%) and Testing(20%)

This approach can give better accuracy. Since the testing data will be more recent and hence better prediction.

What is k-nearest neighbors algorithm

In k-NN  we have a labeled dataset and we want to use it to predict an unlabelled query point. Stay with me if you don’t get this.

Query point is a similar object that has not yet been classified.

To understand k-NN let us consider a simple dataset of Bollywood Movies.

Name Lead Actor Genre Director Production Box Office
Dangal Aamir Khan Sports Nitish Aamir Khan Hit
Badhai Ho Ayushman Comedy Amit Amit Sharma Hit
Namaste England Arjun Romantic Vipul Shah Pen India Flop

In this dataset we have box-office as the label. Based on the attribute values a movie is classified as a Hit or a Flop. Of course there will be lot of other factors but for ease of understanding we have taken less features.

k-NN helps us predict if a new movie will be a Hit or a Flop, using this data.

A given movie in this dataset can be represented as a Vector. Vectors are used to represent a point in space. A vector has a magnitude and direction.  For more please refer to this article.

Converting to vector enables us to use the mathematical approach to solve such problems.

For example we can represent the move Dangal as  a Vector <1, 2, 3, 4, 5> where each value corresponds to a column. 1 represents Lead Actor for Dangal, 2 – Genre etc.

This helps us to compare two points easily. Since comparing two Vectors is easy.

How k-NN works?

At a high level it’s a two step process:

  1. We look at the given dataset and find “k” data points that are similar to the query point. We will discuss similarity shortly.
  2. From this list of similar data points we do a majority vote.

So we need to define two things – similarity and k value (number of neighbors which can vote). We will look at these two shortly.

k-NN Algorithm

For k-NN we divide the dataset into two parts. Training and Test dataset. The division is random. An example can be 80% training and 20% test.

It runs in two phases

  1. Training Phase – In k-NN we don’t use the training data to come up with a model. We feed this data to algorithm. This data is used to find the nearest neighbors.
  2. Test Phase – The Test data is used to find an optimal k value. For selecting k we need to experiment with different k values. Generally an odd k value is selected to help with majority voting. Like we can experiment with the following k values – 1,3,5,7 etc. For each k value we find accuracy using the test data. Accuracy is defined as the ratio of number of correct labels to total number of labels used. For this we provide as input a given data point find the corresponding label and then validate.

I mentioned “Similarity” earlier. Let’s see what it is.

Similarity refers to the closeness between two objects. There are various ways to compute distance between any two given points. Distance between two points can help us figure out if the objects are similar. The one we use here is Euclidean distance.

The Euclidean distance between points p and q is the length of the line segment connecting them.

Other ways to calculate distance includes Manhattan Distance, Cosine Similarity, Hamming Distance etc. To keep this article short and focussed on k-NN we are not going to get into the details of these distances.

Applying k-NN

Coming back to our Movies dataset let’s see how to apply what we learnt so far.

A new movie starring Aamir Khan is coming and we would like to predict if it will be a Hit or not.

We follow these steps:

  • Use the train data set and test data set to find a k-value. Find movies that are similar to the new movie using the attributes. Like Genre, Director etc. We use Euclidean distance for this.
  • Do a majority vote. Like if k value is 3 and 2 similar movies have been Hit then this will also be Hit.

Limitations of k-NN

The space and time complexity for k-NN is O(N*D) where N is the number of data points with D-dimensions or attributes. This means given a new movie it will take this much space and time to predict it’s outcome. Which is not good.

I hope this article gave you a basic idea about k-NN.

Related Concepts

  • Overfitting
  • Underfitting
  • Cross Validation – Data is divided into 3 parts. Train, Cross Validate and Test. We use cross validate data to find the value of k. And test data is used only for testing purpose. This gives a better outcome.