The problem we are trying to solve is to predict the class of a new data point, given a dataset with pre-classified data points.
Two key ideas we will use here are k-NN algorithm and LSH. If you don’t know about these concepts then I will suggest you to check them out first.
What is Cosine Similarity?
At a high level cosine similarity can tell us how similar two points are. To do this we compute the vector representation for the two points and then find the angle between the two vectors.
The similarity between vectors a and b can be given by cosine of the angle between them.
We can use this concept to calculate the hash value for a data point.
Now that we know cosine similarity we can use this to calculate LSH values for data points. To do this we divide the space using hyperplanes.
Refer to the image below to understand each of the points explained next.
For simplicity consider a 2-D space with X-Y axis. We can divide this into 4 regions by using 2 planes / lines L1 and L2.
So a data point “A” will reside in one of these regions. For each plane we can find in which direction the point “A” lies, by using the concept of normal vector.
This way we can find the value for each plane. For each plane the value will be either +1 or -1. We can use this to calculate Hash Key.
Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.
Say the new point lands in the bucket with key =1. Then we know it’s near to the points A,B. Next apply k-NN to find it’s classification.