In k-NN we have a labeled dataset and we want to use it to predict an unlabelled query point. Stay with me if you don’t get this.
Query point is a similar object that has not yet been classified.
To understand k-NN let us consider a simple dataset of Bollywood Movies.
|Name||Lead Actor||Genre||Director||Production||Box Office|
|Dangal||Aamir Khan||Sports||Nitish||Aamir Khan||Hit|
|Badhai Ho||Ayushman||Comedy||Amit||Amit Sharma||Hit|
|Namaste England||Arjun||Romantic||Vipul Shah||Pen India||Flop|
In this dataset we have box-office as the label. Based on the attribute values a movie is classified as a Hit or a Flop. Of course there will be lot of other factors but for ease of understanding we have taken less features.
k-NN helps us predict if a new movie will be a Hit or a Flop, using this data.
A given movie in this dataset can be represented as a Vector. Vectors are used to represent a point in space. A vector has a magnitude and direction. For more please refer to this article.
Converting to vector enables us to use the mathematical approach to solve such problems.
For example we can represent the move Dangal as a Vector <1, 2, 3, 4, 5> where each value corresponds to a column. 1 represents Lead Actor for Dangal, 2 – Genre etc.
This helps us to compare two points easily. Since comparing two Vectors is easy.
How k-NN works?
At a high level it’s a two step process:
- We look at the given dataset and find “k” data points that are similar to the query point. We will discuss similarity shortly.
- From this list of similar data points we do a majority vote.
So we need to define two things – similarity and k value (number of neighbors which can vote). We will look at these two shortly.
For k-NN we divide the dataset into two parts. Training and Test dataset. The division is random. An example can be 80% training and 20% test.
It runs in two phases
- Training Phase – In k-NN we don’t use the training data to come up with a model. We feed this data to algorithm. This data is used to find the nearest neighbors.
- Test Phase – The Test data is used to find an optimal k value. For selecting k we need to experiment with different k values. Generally an odd k value is selected to help with majority voting. Like we can experiment with the following k values – 1,3,5,7 etc. For each k value we find accuracy using the test data. Accuracy is defined as the ratio of number of correct labels to total number of labels used. For this we provide as input a given data point find the corresponding label and then validate.
I mentioned “Similarity” earlier. Let’s see what it is.
Similarity refers to the closeness between two objects. There are various ways to compute distance between any two given points. Distance between two points can help us figure out if the objects are similar. The one we use here is Euclidean distance.
The Euclidean distance between points p and q is the length of the line segment connecting them.
Other ways to calculate distance includes Manhattan Distance, Cosine Similarity, Hamming Distance etc. To keep this article short and focussed on k-NN we are not going to get into the details of these distances.
Coming back to our Movies dataset let’s see how to apply what we learnt so far.
A new movie starring Aamir Khan is coming and we would like to predict if it will be a Hit or not.
We follow these steps:
- Use the train data set and test data set to find a k-value. Find movies that are similar to the new movie using the attributes. Like Genre, Director etc. We use Euclidean distance for this.
- Do a majority vote. Like if k value is 3 and 2 similar movies have been Hit then this will also be Hit.
Limitations of k-NN
The space and time complexity for k-NN is O(N*D) where N is the number of data points with D-dimensions or attributes. This means given a new movie it will take this much space and time to predict it’s outcome. Which is not good.
I hope this article gave you a basic idea about k-NN.
- Cross Validation – Data is divided into 3 parts. Train, Cross Validate and Test. We use cross validate data to find the value of k. And test data is used only for testing purpose. This gives a better outcome.