Blog

Python Programming Notes

Below I have listed some basic concepts that will help you to get started with programming in Python.

This is in no way extensive and familiarity with a programming language will be helpful. I will recommend you to visit the official Python documentation which will come in handy.

Comments can be added using #

Basic data types include Numbers, String etc.

Strings are immutable in Python. Which means you cannot modify a string.

Format method can be used to print values

x1 = ‘Jack’
print(‘{0} is new to the city’.format(x1))

Python does not allow special characters such as @, $, and % within identifiers. Python is a case-sensitive programming language.

Python is strongly object-oriented in the sense that everything is an object including numbers, strings and functions.

How to indent

Python does not use braces. Use four spaces for indentation.
Example:
if(a == 5):
print(‘Indentation works’)

Functions

Functions in Python let you create reusable piece of code. A function can have a block of code which performs a specific task. Like finding a prime number. Let’s see a simple function which returns sum of two numbers.

def getSum(a,b):
return a+b

VarArgs parameters
Sometimes you may want to define a function that can take any number of parameters, i.e. variable number of arguments, this can be achieved by using the stars.

def getSum(*numbers):
    sum = 0
    for x in numbers:
        sum = sum+x
    return sum

print(getSum(3,4,5,6))

What * does is to collect all values in a tuple. With ** we get the values in a dictionary.

DocStrings
This allows us to provide documentation for a function. A string on the first logical line of a function is the docstring for that function.

def getSum(*numbers):
“””This function return sum of the numbers.
“””
return sum
print(getSum.doc)

Modules
If we want to group a set of functions together to make a library. We can create a module by putting functions in a .py file or we can also write code in a language like C. Upon compilation it can be used by the python interpreter.

To use a module we need to import it. To import a function from a module we can use
from math import sqrt
from test_module import getDiff

Packages
Modules can be clubbed together in Packages.

Data Structures in Python
Data Structures are used to efficiently organise, manage and access data.

Python provides four types of data structures : List, Tuple, Dictionary and Set.

List Data Structure
Elements in a List can be of different data types. They are ordered. A List is mutable meaning it can be modified.

myList = [2,3,'a']
print(myList)

To iterate through a List we can use

for i in range(len(numbers)):
    numbers[i] = numbers[i] * 2

List Operations
The + operator concatenates lists
The * operator makes it repeat
The slice operator also works on lists: print(myList[2:3])
We can use append() to add value to the end of the list.
sort() is available to sort values in list
To add all elements of the list we can use sum(myList)

Map, Reduce and Filter
Map is when we have a function that maps each element of a list with another. Like map each character to its upper case.
Reduce is like function that sums all values of the list and returns just one element.
Filter is like getting a subset based on some condition.

Delete from a List
pop can be used like d = myList.pop(3) It returns the element that was removed.
We can also use del with slice index. Like : del mylist[1:2]

Dictionaries Data Structure
A key-value pair. When we print the dictionary the order of items may be different.
To traverse the keys in sorted order, you can use the built-in function sorted.
We cannot use a list as a key since it’s mutable. We may not get the same hash value.
In a key-value pair the hash value is computed using the key.

dict = {1:”Ram”, 2:”Mike”}

Tuples are like Lists but are immutable.

t = (32, 34, 23)

Another way to create a Tuple is using t = tuple()
If we try to do this now t[0] = ‘g’; we get TypeError: ‘tuple’ object does not support item assignment

Set Data Structure
Python also supports Sets. Sets can be used to hold unique elements. But the elements in a set need not be ordered.

a = {2,3,4}

Why learn Python?

Why should I learn Python is a very valid concern for anyone. Especially if you come from a Java, C or a JavaScript background.

Below I try to answer this question.

  • Python is a simple but powerful language that lets you focus on problem at hand rather than syntax etc.
  • It’s easy to learn. If you are familiar with an object oriented programming language like Java or a functional programming language like Javascript then it will be very easy for you to get upto speed with Python.
  • It provides effective high level data structures along with object oriented features.
  • Its free and open source.
  • Its portable, it can be ported to any platform.
  • Interpreted : Python does not need compilation to binary. Python converts the source code into an intermediate form called byte-codes and then translates this into the native language of your computer and then runs it.
  • It supports both procedure and object oriented programming.
  • Python comes with a rich set of standard libraries. These can help us with all sorts of functionalities like databases, multithreading, regular expressions etc. Apart from this there are various other high quality libraries available.
  • People claim that using Python makes programming easier for them.

Let’s look at some other factors which favour Python over low level languages like C.

C vs Python

While the best possible runtime performance can be achieved in a low-level C programming language, working in a high-level language such as Python usually reduces the development time and often results in more flexible and extensible code.

You can write code in C that can power your Python libraries that are computationally expensive.

Today CPU-hours are cheap and are getting cheaper, but man-hours are expensive. This makes a strong case for minimising development time rather than the runtime of a computation by using a high-level programming language and environment such as Python and its scientific computing libraries.

Hence a solution that partially avoids the trade-off between high- and low-level languages is to use a high-level language for interface libraries and low-level languages for implementations.

Python excels at this type of integration. Code written in C can be used for computationally expensive operations. At high level for interface etc. Python can be used. This is an important reason why Python is a popular language for numerical computing.

Some other features include:

  • No braces needed. Statement grouping is done via indentation.
  • No variable declaration is needed.
  • High level data types allows you to express complex operations in a single statement.

Python is quickly becoming the language of choice when it comes to Data Analysis and Machine Learning.  

Python for Engineering and Scientific Applications

SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering.

In particular, these are some of the core packages:

  1. NumPy
  2. SciPy
  3. Matplotlib
  4. iPython
  5. Sympy
  6. Pandas

References

An introduction to NumPy

What is NumPy?

In Python efficient data structures for working with Arrays are provided by NumPy. These data structures form the the core of NumPy library. NumPy is primarily used for performing Scientific computations. The elements in a NumPy array are of the same type.

Why use NumPy?

The core of NumPy is implemented in C and hence is pretty efficient. Using the data structures provided by NumPy improves the performance. It’s a very important part of scientific Python ecosystem.

How Python List differs from NumPy Arrays?

A Python list is a heterogenous collection of elements. Whereas in NumPy array all elements are of same data type and array is of fixed size. In addition NumPy provides a large set of functions to work with the data structures.

How to use NumPy?

To get started import the NumPy library using
import numpy as np

What is ndarray class in NumPy?

It is the main class to represent a multidimensional array. In addition to the element values It also stores meta data. Like type, shape, size etc.
To create an ndarray one way is to use the following code
Y = np.array([1, 2, 3, 4, 4, 5])

By typing np.ndarray we can get all the attributes.

Following attributes are provided as part of ndarray:

  • shape : Dimension of the array like (2,3)
  • size : Number of elements
  • dtype : The data type of the elements in the array.

For numerical work the most important data types are int (for integers), float (for floating-point numbers), and complex (for complex floating-point numbers).
To create an array of float type elements we can give
np.array([1, 2, 3], dtype=np.float)

Working with complex data type
data = np.array([1, 2, 3], dtype=complex)

We can either print this using
data

Or get real and imaginary parts using real and imaginary attributes
data.real
data.imaginary

In NumPy the default format to store multi dimensional array is row-major.

How to create ndarray?

Using np.array is a basic way to create an ndarray. But practically there may be requirements, like reading data from a file and creating an ndarray, which need to be handled differently. NumPy library provides a rich set of functions to handle this.

Let’s look at some of the functions

  • np.array : using a Python list for example. Ex. a = np.array([34,44,54]) creates a one dimensional array.
  • np.zeros : Array filled with 0s
  • np.ones : Array filled with 1s
  • np.from-file : read data from a text file.
  • np.random.rand : Generates an array with random numbers that are uniformly distributed between 0 and 1. Other types of distributions are also available in the np.random module .
  • np.full : create an array filled with a value. Ex. a = np.full(5, 2)


NumPy library provides us with two methods to create a range of values that are evenly spaced.

Like if we need a sequence like 2,4,6 etc. there are two ways:

  • np.arange(start, end, increment)
  • np.linspace(start, end, no_of_elements)

np.logspace can be used to distribute elements logarithmically.

What is a Meshgrid Array?

For generating multidimensional coordinate grids we can use np.meshgrid.

Ex:
X = np.array([2,3,4])
Y = np.array([5,6,7])
a,b = np.meshgrid(X,Y)

Output
a = ([
   [2,3,4],
   [2,3,4],
   [2,3,4]
])
b = ([
   [5,5,5],
   [6,6,6],
   [7,7,7]
])

np.empty is also handy if we just want to declare an array without initializing it. This can save some time. But using np.zeros is better.

What is Slicing?

Let’s talk about 1-D arrays. Slicing can be done to select range of elements. We can use negative integers to extract elements from the end of the array. Like x = a[-2]

Look at the following slice examples:

  • a[m:n] selects elements in the array starting at m and ending at (n-1).
  • a[m:n:2] selects elements in the array starting at m and ending at (n-1) in increments of 2.
  • a[::-1] selects all elements in reverse order.
  • a[-5:] selects last 5 elements.

Example :
In : a[1:-1:2]
Out : array([1, 3, 5, 7, 9])


Let’s talk about multi-dimensional array. In this case we can apply the slicing operation on each axis. Let’s use lambda function and apply it on a 6*6 array.

In : f = lambda m, n: n + 10 * m
In : a = np.fromfunction(f, (6, 6), dtype=int)
In : a
Out : array([ [ 0, 1, 2, 3, 4, 5],
[10, 11, 12, 13, 14, 15],
[20, 21, 22, 23, 24, 25],
[30, 31, 32, 33, 34, 35],
[40, 41, 42, 43, 44, 45],
[50, 51, 52, 53, 54, 55]
])

Look at the following slice examples:
a[:,1] gives us the second column i.e. [1,11,21,31,41,51]
a[2:,:2] gives ([
[20,21],
[30,31],
])

What is Reshaping and Resizing?

Sometimes rearranging arrays can be helpful. Like arranging a N*N matrix as a vector of size N^2.

Some of the functions that NumPy provides to reshape an ndarray are:

  • np.ndarray.flatten : Create a new 1-D array. Collapses all dimensions to just one.
  • np.reshape : Reshape an n-dim array.
  • np.squeeze : Removes axes with length 1.
  • np.ravel : Similar to flatten but modifies original.
  • np.transpose

Example

In : data = np.array([[1, 2], [3, 4]])
In : np.reshape(data, (1, 4)) // Creates a new array
Out : array([[1, 2, 3, 4]])
In : data.reshape(4) // Modifies existing one
Out : array([1, 2, 3, 4])

Generally the NumPy library gives two options – either modify existing array or create a new one.

Arithmetic Operations on Matrices

We can perform standard arithmetic operations on Matrices.

Ex
In : x = np.array([[1, 2], [3, 4]])
In : y = np.array([[5, 6], [7, 8]])
In : x + y
Out: array([[ 6, 8],
[10, 12]])

If we multiply a matrix with a scalar then it will apply to all the elements of the matrix.

When we apply an arithmetic operation then we get a new array. This can impact memory footprint and also degrade performance. It’s better to use inplace operation in such cases.

Like
x = x + y // uses __add__
x += y // uses __iadd__ which is an in place operator

Refer to – https://stackoverflow.com/questions/4772987/how-are-python-in-place-operator-functions-different-than-the-standard-operator

Similar behavior is observed for other mathematical operations like Trignometric, Logarithmic etc. Vectorized operations are applied and we get a new array as a result.

Aggregate Functions

We can perform aggregate operations using NumPy like taking sum of all array elements or finding the mean, median, standard deviation etc. We can specify the axis also along which operation needs to be performed.

Like a.sum(axis=0)

Set Operations

NumPy also provides the capability to work with a sets. A set is used to store unique, unordered elements. NumPy provides a set of functions to work on a set of elements.
Like np.unique can be used to get a set of unique elements.
a = np.unique([1, 2, 3, 3])
We can perform union, intersection etc. operations between two given sets.

Summary

NumPy is a core library for computing with Python that provides a foundation for nearly all computational libraries for Python. Familiarity with the NumPy library and its usage patterns is a fundamental skill for using Python for scientific and technical computing.

Locality Sensitive Hashing using Euclidean Distance

It’s quite similar to Locality Sensitive Hashing (LSH) for Cosine Similarity which we covered earlier. I will be referring to the same here, so it’s better if you go through the same before proceeding.

The difference lies in the way we compute hash value. As we have seen we can divide the region using planes. In each region we can have data-points.

Follow these steps (refer to diagram)

LSH for Euclidean
LSH for Euclidean

1. Divide the plane into small parts.
2. Project each data-point on the planes.
3. For each datapoint take the distance along each plane and use it to calculate the hash value.

Rest of the procedure is similar to cosine similarity process. Like finding the nearest neighbor.

Locality Sensitive Hashing using Cosine Similarity

The problem we are trying to solve is to predict the class of a new data point, given a dataset with pre-classified data points.

Two key ideas we will use here are  k-NN algorithm and LSH. If you don’t know about these concepts then I will suggest you to check them out first.

What is Cosine Similarity?
At a high level cosine similarity can tell us how similar two points are. To do this we compute the vector representation for the two points and then find the angle between the two vectors.

The similarity between vectors a and b can be given by cosine of the angle between them.

We can use this concept to calculate the hash value for a data point.

Now that we know cosine similarity we can use this to calculate LSH values for data points. To do this we divide the space using hyperplanes.

Refer to the image below to understand each of the points explained next.

Hashing using Cosine Similarity
Hashing using Cosine Similarity

For simplicity consider a 2-D space with X-Y axis. We can divide this into 4 regions by using 2 planes / lines L1 and L2.

So a data point “A” will reside in one of these regions. For each plane we can find in which direction the point “A” lies, by using the concept of normal vector.

This way we can find the value for each plane. For each plane the value will be either +1 or -1. We can use this to calculate Hash Key.

Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.

Say the new point lands in the bucket with key =1. Then we know it’s near to the points A,B. Next apply k-NN to find it’s classification.

What is a k-d tree

k-d tree is a binary-tree based data structure. It can be used for data which is k-dimensional.

What do we mean by k-dimensional?
You may have heard of 2-D and 3-D. Similarly, we have higher dimension space.

Why do we need k-d tree?
Using k-d tree we can partition a k-dimensional space into regions. This allows for efficient searching. It’s used in computer graphics and nearest neighbour searches.

How it works?
Let us consider a 2-D dataset. A point in this can be represented as <X,Y>

Constructing a k-d tree

We follow these steps to construct the tree.

1. Select a dimension and project all points along that. Example: Project all points along X-axis.
2. Take the mean of points generated along X-axis. Let’s call it X_M
3. Split using the mean. If a point X1 is < X_M then it goes to the left sub-tree else right.
4. Repeat steps 1-3 for all dimensions.

By constructing this tree we have partitioned the space into smaller regions. Given a new query point we can traverse through the Tree to find an appropriate region.

What is Locality Sensitive Hashing

It’s a beautiful technique to find similar points that is nearest neighbours. It uses the concept of Hashing for this.

You may be familiar with how Hash Table is constructed. If not, please follow this link for a quick refresher on Hashing concepts. It’s a very efficient data structure which allows us to perform operations in O(1) time.

How it Works?

We apply a hash function to each available data point. This gives us the bucket or the key where the point can be placed.

We try to maximize collisions, so that similar points go to same bucket.

Now if we get an unlabelled query point and apply the hash function then it will go to same bucket where similar points exist.

This makes it easy to find the nearest neighbors for the query point. And subsequently we can apply k-NN to predict it’s class.