k-d tree is a binary-tree based data structure. It can be used for data which is k-dimensional.
What do we mean by k-dimensional?
You may have heard of 2-D and 3-D. Similarly, we have higher dimension space.
Why do we need k-d tree?
Using k-d tree we can partition a k-dimensional space into regions. This allows for efficient searching. It’s used in computer graphics and nearest neighbour searches.
How it works?
Let us consider a 2-D dataset. A point in this can be represented as <X,Y>
Constructing a k-d tree
We follow these steps to construct the tree.
1. Select a dimension and project all points along that. Example: Project all points along X-axis.
2. Take the mean of points generated along X-axis. Let’s call it X_M
3. Split using the mean. If a point X1 is < X_M then it goes to the left sub-tree else right.
4. Repeat steps 1-3 for all dimensions.
By constructing this tree we have partitioned the space into smaller regions. Given a new query point we can traverse through the Tree to find an appropriate region.
We have seen the basic k-NN algorithm. You may want to read that first to better understand this concept.
Using k-NN we try to predict the class of a query point “Q” (an unclassified data-point). To do this we use the concept of neighbours or points that are similar or close to the query point and then do majority vote.
But we miss something here. Suppose we have 5 neighbours. Don’t you think that the points which are closer to Q needs to be given more importance?
That’s what we do here. We assign weights to the data-points based on their distance from Q. One way to calculate weight is to divide distance by 1 i.e. weight = 1/distance
Consider the following neighbours of Q. Here we have added a new column Weight = 1/Distance
Now that we have weights assigned to each neighbour we can use weighted k-NN to predict the Label for Q. To do this we calculate positive and negative sums.
We calculate sum of weights for positive labels i.e. A,C : 10+2.5 = 12.5
We calculate sum of weights for negative labels i.e. B,D : 5+2 = 7
Since Positive Sum > Negative Sum we can say that for Q the label will be Positive.
In k-NN we have a labeled dataset and we want to use it to predict an unlabelled query point. Stay with me if you don’t get this.
Query point is a similar object that has not yet been classified.
To understand k-NN let us consider a simple dataset of Bollywood Movies.
In this dataset we have box-office as the label. Based on the attribute values a movie is classified as a Hit or a Flop. Of course there will be lot of other factors but for ease of understanding we have taken less features.
k-NN helps us predict if a new movie will be a Hit or a Flop, using this data.
A given movie in this dataset can be represented as a Vector. Vectors are used to represent a point in space. A vector has a magnitude and direction. For more please refer to this article.
Converting to vector enables us to use the mathematical approach to solve such problems.
For example we can represent the move Dangal as a Vector <1, 2, 3, 4, 5> where each value corresponds to a column. 1 represents Lead Actor for Dangal, 2 – Genre etc.
This helps us to compare two points easily. Since comparing two Vectors is easy.
How k-NN works?
At a high level it’s a two step process:
We look at the given dataset and find “k” data points that are similar to the query point. We will discuss similarity shortly.
From this list of similar data points we do a majority vote.
So we need to define two things – similarity and k value (number of neighbors which can vote). We will look at these two shortly.
For k-NN we divide the dataset into two parts. Training and Test dataset. The division is random. An example can be 80% training and 20% test.
It runs in two phases
Training Phase – In k-NN we don’t use the training data to come up with a model. We feed this data to algorithm. This data is used to find the nearest neighbors.
Test Phase – The Test data is used to find an optimal k value. For selecting k we need to experiment with different k values. Generally an odd k value is selected to help with majority voting. Like we can experiment with the following k values – 1,3,5,7 etc. For each k value we find accuracy using the test data. Accuracy is defined as the ratio of number of correct labels to total number of labels used. For this we provide as input a given data point find the corresponding label and then validate.
I mentioned “Similarity” earlier. Let’s see what it is.
Similarity refers to the closeness between two objects. There are various ways to compute distance between any two given points. Distance between two points can help us figure out if the objects are similar. The one we use here is Euclidean distance.
The Euclidean distance between points p and q is the length of the line segment connecting them.
Other ways to calculate distance includes Manhattan Distance, Cosine Similarity, Hamming Distance etc. To keep this article short and focussed on k-NN we are not going to get into the details of these distances.
Coming back to our Movies dataset let’s see how to apply what we learnt so far.
A new movie starring Aamir Khan is coming and we would like to predict if it will be a Hit or not.
We follow these steps:
Use the train data set and test data set to find a k-value. Find movies that are similar to the new movie using the attributes. Like Genre, Director etc. We use Euclidean distance for this.
Do a majority vote. Like if k value is 3 and 2 similar movies have been Hit then this will also be Hit.
Limitations of k-NN
The space and time complexity for k-NN is O(N*D) where N is the number of data points with D-dimensions or attributes. This means given a new movie it will take this much space and time to predict it’s outcome. Which is not good.
I hope this article gave you a basic idea about k-NN.
Cross Validation – Data is divided into 3 parts. Train, Cross Validate and Test. We use cross validate data to find the value of k. And test data is used only for testing purpose. This gives a better outcome.