An introduction to NumPy

What is NumPy?

In Python efficient data structures for working with Arrays are provided by NumPy. These data structures form the the core of NumPy library. NumPy is primarily used for performing Scientific computations. The elements in a NumPy array are of the same type.

Why use NumPy?

The core of NumPy is implemented in C and hence is pretty efficient. Using the data structures provided by NumPy improves the performance. It’s a very important part of scientific Python ecosystem.

How Python List differs from NumPy Arrays?

A Python list is a heterogenous collection of elements. Whereas in NumPy array all elements are of same data type and array is of fixed size. In addition NumPy provides a large set of functions to work with the data structures.

How to use NumPy?

To get started import the NumPy library using
import numpy as np

What is ndarray class in NumPy?

It is the main class to represent a multidimensional array. In addition to the element values It also stores meta data. Like type, shape, size etc.
To create an ndarray one way is to use the following code
Y = np.array([1, 2, 3, 4, 4, 5])

By typing np.ndarray we can get all the attributes.

Following attributes are provided as part of ndarray:

  • shape : Dimension of the array like (2,3)
  • size : Number of elements
  • dtype : The data type of the elements in the array.

For numerical work the most important data types are int (for integers), float (for floating-point numbers), and complex (for complex floating-point numbers).
To create an array of float type elements we can give
np.array([1, 2, 3], dtype=np.float)

Working with complex data type
data = np.array([1, 2, 3], dtype=complex)

We can either print this using
data

Or get real and imaginary parts using real and imaginary attributes
data.real
data.imaginary

In NumPy the default format to store multi dimensional array is row-major.

How to create ndarray?

Using np.array is a basic way to create an ndarray. But practically there may be requirements, like reading data from a file and creating an ndarray, which need to be handled differently. NumPy library provides a rich set of functions to handle this.

Let’s look at some of the functions

  • np.array : using a Python list for example. Ex. a = np.array([34,44,54]) creates a one dimensional array.
  • np.zeros : Array filled with 0s
  • np.ones : Array filled with 1s
  • np.from-file : read data from a text file.
  • np.random.rand : Generates an array with random numbers that are uniformly distributed between 0 and 1. Other types of distributions are also available in the np.random module .
  • np.full : create an array filled with a value. Ex. a = np.full(5, 2)


NumPy library provides us with two methods to create a range of values that are evenly spaced.

Like if we need a sequence like 2,4,6 etc. there are two ways:

  • np.arange(start, end, increment)
  • np.linspace(start, end, no_of_elements)

np.logspace can be used to distribute elements logarithmically.

What is a Meshgrid Array?

For generating multidimensional coordinate grids we can use np.meshgrid.

Ex:
X = np.array([2,3,4])
Y = np.array([5,6,7])
a,b = np.meshgrid(X,Y)

Output
a = ([
   [2,3,4],
   [2,3,4],
   [2,3,4]
])
b = ([
   [5,5,5],
   [6,6,6],
   [7,7,7]
])

np.empty is also handy if we just want to declare an array without initializing it. This can save some time. But using np.zeros is better.

What is Slicing?

Let’s talk about 1-D arrays. Slicing can be done to select range of elements. We can use negative integers to extract elements from the end of the array. Like x = a[-2]

Look at the following slice examples:

  • a[m:n] selects elements in the array starting at m and ending at (n-1).
  • a[m:n:2] selects elements in the array starting at m and ending at (n-1) in increments of 2.
  • a[::-1] selects all elements in reverse order.
  • a[-5:] selects last 5 elements.

Example :
In : a[1:-1:2]
Out : array([1, 3, 5, 7, 9])


Let’s talk about multi-dimensional array. In this case we can apply the slicing operation on each axis. Let’s use lambda function and apply it on a 6*6 array.

In : f = lambda m, n: n + 10 * m
In : a = np.fromfunction(f, (6, 6), dtype=int)
In : a
Out : array([ [ 0, 1, 2, 3, 4, 5],
[10, 11, 12, 13, 14, 15],
[20, 21, 22, 23, 24, 25],
[30, 31, 32, 33, 34, 35],
[40, 41, 42, 43, 44, 45],
[50, 51, 52, 53, 54, 55]
])

Look at the following slice examples:
a[:,1] gives us the second column i.e. [1,11,21,31,41,51]
a[2:,:2] gives ([
[20,21],
[30,31],
])

What is Reshaping and Resizing?

Sometimes rearranging arrays can be helpful. Like arranging a N*N matrix as a vector of size N^2.

Some of the functions that NumPy provides to reshape an ndarray are:

  • np.ndarray.flatten : Create a new 1-D array. Collapses all dimensions to just one.
  • np.reshape : Reshape an n-dim array.
  • np.squeeze : Removes axes with length 1.
  • np.ravel : Similar to flatten but modifies original.
  • np.transpose

Example

In : data = np.array([[1, 2], [3, 4]])
In : np.reshape(data, (1, 4)) // Creates a new array
Out : array([[1, 2, 3, 4]])
In : data.reshape(4) // Modifies existing one
Out : array([1, 2, 3, 4])

Generally the NumPy library gives two options – either modify existing array or create a new one.

Arithmetic Operations on Matrices

We can perform standard arithmetic operations on Matrices.

Ex
In : x = np.array([[1, 2], [3, 4]])
In : y = np.array([[5, 6], [7, 8]])
In : x + y
Out: array([[ 6, 8],
[10, 12]])

If we multiply a matrix with a scalar then it will apply to all the elements of the matrix.

When we apply an arithmetic operation then we get a new array. This can impact memory footprint and also degrade performance. It’s better to use inplace operation in such cases.

Like
x = x + y // uses __add__
x += y // uses __iadd__ which is an in place operator

Refer to – https://stackoverflow.com/questions/4772987/how-are-python-in-place-operator-functions-different-than-the-standard-operator

Similar behavior is observed for other mathematical operations like Trignometric, Logarithmic etc. Vectorized operations are applied and we get a new array as a result.

Aggregate Functions

We can perform aggregate operations using NumPy like taking sum of all array elements or finding the mean, median, standard deviation etc. We can specify the axis also along which operation needs to be performed.

Like a.sum(axis=0)

Set Operations

NumPy also provides the capability to work with a sets. A set is used to store unique, unordered elements. NumPy provides a set of functions to work on a set of elements.
Like np.unique can be used to get a set of unique elements.
a = np.unique([1, 2, 3, 3])
We can perform union, intersection etc. operations between two given sets.

Summary

NumPy is a core library for computing with Python that provides a foundation for nearly all computational libraries for Python. Familiarity with the NumPy library and its usage patterns is a fundamental skill for using Python for scientific and technical computing.

Locality Sensitive Hashing using Euclidean Distance

It’s quite similar to Locality Sensitive Hashing (LSH) for Cosine Similarity which we covered earlier. I will be referring to the same here, so it’s better if you go through the same before proceeding.

The difference lies in the way we compute hash value. As we have seen we can divide the region using planes. In each region we can have data-points.

Follow these steps (refer to diagram)

LSH for Euclidean
LSH for Euclidean

1. Divide the plane into small parts.
2. Project each data-point on the planes.
3. For each datapoint take the distance along each plane and use it to calculate the hash value.

Rest of the procedure is similar to cosine similarity process. Like finding the nearest neighbor.

Locality Sensitive Hashing using Cosine Similarity

The problem we are trying to solve is to predict the class of a new data point, given a dataset with pre-classified data points.

Two key ideas we will use here are  k-NN algorithm and LSH. If you don’t know about these concepts then I will suggest you to check them out first.

What is Cosine Similarity?
At a high level cosine similarity can tell us how similar two points are. To do this we compute the vector representation for the two points and then find the angle between the two vectors.

The similarity between vectors a and b can be given by cosine of the angle between them.

We can use this concept to calculate the hash value for a data point.

Now that we know cosine similarity we can use this to calculate LSH values for data points. To do this we divide the space using hyperplanes.

Refer to the image below to understand each of the points explained next.

Hashing using Cosine Similarity
Hashing using Cosine Similarity

For simplicity consider a 2-D space with X-Y axis. We can divide this into 4 regions by using 2 planes / lines L1 and L2.

So a data point “A” will reside in one of these regions. For each plane we can find in which direction the point “A” lies, by using the concept of normal vector.

This way we can find the value for each plane. For each plane the value will be either +1 or -1. We can use this to calculate Hash Key.

Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.

Say the new point lands in the bucket with key =1. Then we know it’s near to the points A,B. Next apply k-NN to find it’s classification.

What is a k-d tree

k-d tree is a binary-tree based data structure. It can be used for data which is k-dimensional.

What do we mean by k-dimensional?
You may have heard of 2-D and 3-D. Similarly, we have higher dimension space.

Why do we need k-d tree?
Using k-d tree we can partition a k-dimensional space into regions. This allows for efficient searching. It’s used in computer graphics and nearest neighbour searches.

How it works?
Let us consider a 2-D dataset. A point in this can be represented as <X,Y>

Constructing a k-d tree

We follow these steps to construct the tree.

1. Select a dimension and project all points along that. Example: Project all points along X-axis.
2. Take the mean of points generated along X-axis. Let’s call it X_M
3. Split using the mean. If a point X1 is < X_M then it goes to the left sub-tree else right.
4. Repeat steps 1-3 for all dimensions.

By constructing this tree we have partitioned the space into smaller regions. Given a new query point we can traverse through the Tree to find an appropriate region.

What is Locality Sensitive Hashing

It’s a beautiful technique to find similar points that is nearest neighbours. It uses the concept of Hashing for this.

You may be familiar with how Hash Table is constructed. If not, please follow this link for a quick refresher on Hashing concepts. It’s a very efficient data structure which allows us to perform operations in O(1) time.

How it Works?

We apply a hash function to each available data point. This gives us the bucket or the key where the point can be placed.

We try to maximize collisions, so that similar points go to same bucket.

Now if we get an unlabelled query point and apply the hash function then it will go to same bucket where similar points exist.

This makes it easy to find the nearest neighbors for the query point. And subsequently we can apply k-NN to predict it’s class.

What is Weighted k-NN Algorithm

We have seen the basic k-NN algorithm. You may want to read that first to better understand this concept.

Using k-NN we try to predict the class of a query point “Q” (an unclassified data-point). To do this we use the concept of neighbours or points that are similar or close to the query point and then do majority vote.

But we miss something here. Suppose we have 5 neighbours. Don’t you think that the points which are closer to Q needs to be given more importance?

That’s what we do here. We assign weights to the data-points based on their distance from Q. One way to calculate weight is to divide distance by 1 i.e. weight = 1/distance

Consider the following neighbours of Q. Here we have added a new column Weight = 1/Distance

Point Label Distance Weight
A Positive 0.1 10
B Negative 0.2 5
C Positive 0.4 2.5
D Negative 0.5 2

Now that we have weights assigned to each neighbour we can use weighted k-NN to predict the Label for Q. To do this we calculate positive and negative sums.

Positive Sum
We calculate sum of weights for positive labels i.e. A,C : 10+2.5 = 12.5

Negative Sum
We calculate sum of weights for negative labels i.e. B,D : 5+2 = 7

Since Positive Sum > Negative Sum we can say that for Q the label will be Positive.

Time Based Splitting of Data

In a Machine Learning algorithm we can split the given dataset into training and test data. We can either split randomly or use time based splitting.

For time based splitting we need a timestamp as one of the attributes / features.

Like in case of e-commerce website we can have reviews for various products. These reviews can have timestamps also. In such scenarios it’s better to use time-based strategy.

To do this we can first sort the reviews using timestamp and then do the split.

  1. Sort data by time.
  2. Split – Training (80%) and Testing(20%)

This approach can give better accuracy. Since the testing data will be more recent and hence better prediction.

What is k-nearest neighbors algorithm

In k-NN  we have a labeled dataset and we want to use it to predict an unlabelled query point. Stay with me if you don’t get this.

Query point is a similar object that has not yet been classified.

To understand k-NN let us consider a simple dataset of Bollywood Movies.

Name Lead Actor Genre Director Production Box Office
Dangal Aamir Khan Sports Nitish Aamir Khan Hit
Badhai Ho Ayushman Comedy Amit Amit Sharma Hit
Namaste England Arjun Romantic Vipul Shah Pen India Flop

In this dataset we have box-office as the label. Based on the attribute values a movie is classified as a Hit or a Flop. Of course there will be lot of other factors but for ease of understanding we have taken less features.

k-NN helps us predict if a new movie will be a Hit or a Flop, using this data.

A given movie in this dataset can be represented as a Vector. Vectors are used to represent a point in space. A vector has a magnitude and direction.  For more please refer to this article.

Converting to vector enables us to use the mathematical approach to solve such problems.

For example we can represent the move Dangal as  a Vector <1, 2, 3, 4, 5> where each value corresponds to a column. 1 represents Lead Actor for Dangal, 2 – Genre etc.

This helps us to compare two points easily. Since comparing two Vectors is easy.

How k-NN works?

At a high level it’s a two step process:

  1. We look at the given dataset and find “k” data points that are similar to the query point. We will discuss similarity shortly.
  2. From this list of similar data points we do a majority vote.

So we need to define two things – similarity and k value (number of neighbors which can vote). We will look at these two shortly.

k-NN Algorithm

For k-NN we divide the dataset into two parts. Training and Test dataset. The division is random. An example can be 80% training and 20% test.

It runs in two phases

  1. Training Phase – In k-NN we don’t use the training data to come up with a model. We feed this data to algorithm. This data is used to find the nearest neighbors.
  2. Test Phase – The Test data is used to find an optimal k value. For selecting k we need to experiment with different k values. Generally an odd k value is selected to help with majority voting. Like we can experiment with the following k values – 1,3,5,7 etc. For each k value we find accuracy using the test data. Accuracy is defined as the ratio of number of correct labels to total number of labels used. For this we provide as input a given data point find the corresponding label and then validate.

I mentioned “Similarity” earlier. Let’s see what it is.

Similarity refers to the closeness between two objects. There are various ways to compute distance between any two given points. Distance between two points can help us figure out if the objects are similar. The one we use here is Euclidean distance.

The Euclidean distance between points p and q is the length of the line segment connecting them.

Other ways to calculate distance includes Manhattan Distance, Cosine Similarity, Hamming Distance etc. To keep this article short and focussed on k-NN we are not going to get into the details of these distances.

Applying k-NN

Coming back to our Movies dataset let’s see how to apply what we learnt so far.

A new movie starring Aamir Khan is coming and we would like to predict if it will be a Hit or not.

We follow these steps:

  • Use the train data set and test data set to find a k-value. Find movies that are similar to the new movie using the attributes. Like Genre, Director etc. We use Euclidean distance for this.
  • Do a majority vote. Like if k value is 3 and 2 similar movies have been Hit then this will also be Hit.

Limitations of k-NN

The space and time complexity for k-NN is O(N*D) where N is the number of data points with D-dimensions or attributes. This means given a new movie it will take this much space and time to predict it’s outcome. Which is not good.

I hope this article gave you a basic idea about k-NN.

Related Concepts

  • Overfitting
  • Underfitting
  • Cross Validation – Data is divided into 3 parts. Train, Cross Validate and Test. We use cross validate data to find the value of k. And test data is used only for testing purpose. This gives a better outcome.