Your First steps in Computer Vision: Using PyTorch with an example

Contributed By: Dr C. S. Jyothirmayee
LinkedIn Profile:

The term Computer Vision (CV) is used and heard very often in artificial intelligence (AI) and deep learning (DL) applications. The term essentially means… giving a sensory quality, i.e., ‘vision’ to a hi-tech computer using visual data, applying physics, mathematics, statistics and modelling to generate meaningful insights.

The utility of Computer Vision is into gathering image data, process data (high level to low level) and do analysis for different visual decisions. There are different sources to gather visual/image data like Traffic cameras, ATM cameras, mobile devices, and satellites among others. What does processing mean here? It could be edge detection, classification, segmentation or differentiating between different objects present in its environment. 

Image analysis could be in the form of:

Pattern recognition, image processing, signal processing, object detection, anomaly detection, Industrial automation, Medical image processing, Self-driving vehicle, military application or operating Agriculture equipment.

There were certain roadblocks in the way of Computer Vision which have now been overcome. They were:

  1. Noisy or incomplete data
  2. Real-time processing
  3. High computing power and memory requirement

Today, we will take a deep dive into this field under two main aspects:

  1. Tool (PyTorch)
  2. Process (Neural Network)

Let’s first understand the tool that we will be using before we move ahead to the process. We will discuss the following aspects of PyTorch

  • What is PyTorch?
  • Brief History
  • How it differs with other similar frameworks (Keras and TensorFlow)
  • How to Install PyTorch
  • PyTorch Terminology

On its website, it says “FROM RESEARCH TO PRODUCTION: An open-source machine learning framework that accelerates the path from research prototyping to production deployment.” Simply put PyTorch is a deep learning framework and scientific computing package based on Python that uses the power of graphics processing units (GPU). PyTorch is a Python-based library designed to provide flexibility as a deep learning development platform. The PyTorch workflow is as close as possible to the Python scientific computing library: NumPy.

The scientific computing aspect of PyTorch is primarily a result of PyTorch’s tensor library and associated tensor operations. 

What is Tensor?

A tensor is an n-dimensional array. Tensors are super important for deep learning and neural networks because they are the data structures that are ultimately used for building and training neural networks. 

PyTorch tensor objects are created from NumPy n-dimensional arrays objects. This makes the transition between PyTorch and NumPy very cheap from the performance perspective.

The major objectives of PyTorch Developers are:

  1. Develop easy to use API: it’s as simple as Python can be (Previously built on Lua + C combo)
  2. Python support: PyTorch integrates seamlessly with the Python data science stack. It has a stark resemblance to Numpy. 
  3. Dynamic computing graphics: PyTorch provides a framework for creating computing graphics.

A little history, PyTorch was launched in October of 2016 as Torch, it was operated by Facebook. Facebook also operates Caffe2 (Convolutional architecture for the rapid incorporation of resources). It was a challenge to transform a model defined by PyTorch into Caffe2. To this end, Facebook and Microsoft invented an Open Neural Network Exchange (ONNX) in September 2017. Simply put, ONNX was developed to convert models between frames. Caffe2 merged in March 2018 in PyTorch, which facilitates the construction of an extremely complex neural network.

Machine learning has taken on as an answer for computer scientists, different universities and organisations started experimenting with their own frameworks to support their daily research, and Torch was one of the early members of that family. Ronan Collobert, Koray Kavukcuoglu, and Clement Farabet released Torch in 2002 and, later, it was picked up by Facebook AI Research and many other people from several universities and research groups. Lots of start-ups and researchers accepted Torch, and companies started productising their Torch models to serve millions of users. Twitter, Facebook, DeepMind, among others are a part of that list. Torch was designed with three key features in mind which made it a popular tool:

  • Ease of developing numerical algorithms
  • It can be easily extended
  • Its fast speed

PyTorch is equally challenged by TensorFlow and Keras

PyTorch is like NumPy in the way that it manages computations but has a strong GPU support. Similarly, to NumPy, it also has a C (the programming language) backend, so they are both much faster than native Python libraries.

Keras has a simple interface with a small list of well-defined parameters, making the above classes easy to implement. Being a high-level API on top of TensorFlow, we can say that Keras makes TensorFlow easy. While PyTorch provides a similar level of flexibility as TensorFlow, it has a much cleaner interface.

Keras is not a framework on its own, but a high-level API that sits on top of other Deep Learning frameworks. Currently, it supports TensorFlow, Theano, and CNTK.

Differences between Keras, TensorFlow and PyTorch

  1. Capable of running on top of TensorFlow, CNTK and Theano
  2. Focused on direct work with array expressions
  3. The readability is less when compared to Keras
  4. Based on Torch, another deep learning framework based on Lua
  5. This means that the graph is generated on the fly as the operations are created

The architecture of PyTorch

PyTorch is made of different modules which help in executing deep learning models for CV and Natural Language Processing (NLP).

Interaction of these sub-packages and torch packages make deep learning possible. When you install PyTorch, you are creating an appropriate computing framework to do deep learning or parallel computing for matrix calculation and other complex operations on your local machine.

How is this made possible? 

Computational graphs are used to graph the function operations that occur on tensors inside neural networks. PyTorch uses a computational graph that is called a dynamic computational graph. This means that the graph is generated as the operations are created (Highly flexible integration with python).

Installing PyTorch with Anaconda and Conda

The recommended best option is to use the Anaconda Python package manager.

Steps to Follow:

  1. Download and install Anaconda (try to download the latest version and depending on your system type)
  2. Go to PyTorch’s site and find the get started locally section
  3. Specify the appropriate configuration options for your system environment
  1. Open Anaconda Prompt (NOT Anaconda Navigator)

2. Run the presented command in the terminal to install PyTorch (as seen above based on your system inputs)

conda install pytorch torchvision cudatoolkit = 10.1 -c pytorch

3. Once done, you can import the Torch package in Python notebook to start using PyTorch.

4. To verify the PyTorch install: type this in a python notebook

If your torch.cuda.is_available() the call returns false, it may be because you don’t have a supported Nvidia GPU installed on your system. However, don’t worry, a GPU is not required to use PyTorch. GPU is a processor that is good at handling specialised computations like parallel computing and a central processing unit (CPU) is a processor that is good at handling general computations.

There is another option: you can use google collab @

Do we need parallel computing for Deep learning models?

Many of the computations that we do with neural networks can be easily broken into smaller computations in such a way that the set of smaller computations do not depend on one another. They can be carried out faster and in parallel to get to result faster.

Any system which has an NVIDIA chip has in-build GPU. An Nvidia GPU is the hardware that enables parallel computations, while CUDA is a software. PyTorch comes with CUDAware layer that provides an API for developers (Luckily PyTorch comes with CUDA).

Depending upon your workload you can switch from CPU mode to GPU mode easily

 Artificial Neural Network: The Process


We will be covering this area under these two broad headings:

  • Neural Network basics
  • Convolutional Neural Network

What is Artificial Neural Network (NN)?

Biologically speaking, our brain is composed of neurons and they act in a network fashion for internal communication. Information travels from the brain to organs in the form of molecular signal (neurotransmitters) by neural network and feedback from organs are sent back via the same route. So, there is a cross-talk happening every instance and that’s how your body reacts to any external stimuli like sight, noise, sensation.

Our Brain is a unique computer, it gathers data from the environment through our sense organs and assimilates it to make rational decisions.

Deep learning is a specialised area in artificial intelligence where many aspects of machine learning steps are intuitively automated into different systems for faster and quicker response. Artificial neural networks to some extent imitate human neurons and brain in signalling learning, understanding, making sense of loads of data, making the right decisions with low error rates, leading to human survival.

 A neural network is made of a basic unit called a perceptron. Perceptrons concept was developed in the 1950s and 1960s by the scientist Frank Rosenblatt, based on earlier work by Warren McCulloch and Walter Pitts.

In deep learning, interconnected layers of artificial “neurons” (Perceptrons) form a neural network. The idea is to replicate an abstract understanding of how we believe the human brain might process similar information and learn from its surroundings and sensory input.

The network ingests vast amounts of input data, processing it through multiple layers of neurons that learn increasingly complex features of the data at each layer:

There are some crucial requirements, pros and cons for deep learning models, some of them are:

  1. Large datasets
  2. High computational cost (hardware)
  3. Less interpretable (called black box)
  4. Better adaptability in a different domain (there are different types of NN)
  5. Feature engineering and model implementation within algorithm

There are important concepts you should know about:

FeedForward: Feedforward neural networks are also known as Multi-layered Network of Neurons (MLN). These models are called feedforward because the information only travels forward in the neural network, through the input nodes, then through the hidden layers (single or many layers) and finally through the output nodes.

Backpropagation: Backpropagation as a technique uses gradient descent. It calculates the gradient of the loss function at output and distributes it back through the layers of a deep neural network. The result is adjusted weights for neurons. Although backpropagation may be used in both supervised and unsupervised networks, it is seen as a supervised learning method.

Gradient descent: Gradient descent is a fundamental technique in machine learning (ML) and DL. Essentially it controls how the algorithm decides what adjustments to make after each iteration in order to progress it towards its goal of achieving the desired accuracy of performance at a given task.

Activation functions (AF): AF are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. These functions are selected depending on the requirement.

Refer to for mathematical calculation involved in these steps.

Convolution neural network (CNN)

CNN is a class of deep neural network that is used for Computer Vision (CV) for analysing visual imagery and text analysis for natural language processing (NLP). Neural networks are embarrassingly parallel for this reason. Luckily, many of the computations that we do with neural networks can be easily broken into smaller computations in such a way that the set of smaller computations do not depend on one another. One such example is a convolution.

CNN is made of these following important components and they are:

Convolution Layer (CL): The convolutional layer is the core building block of a CNN. The layer’s parameters consist of a set of learnable filters (or kernels), which have a small receptive field. These filters scan through image pixels and gather information in the batch of pictures/photos. When programming a CNN, the input is a tensor with shape (number of images) x (image width) x (image height) x (image depth). 

Function: Convolutional layers convolve the input and pass its result to the next layer. This is like the response of a neuron in the visual cortex to a specific stimulus.

Activation layer (AL): The convolution layer generates a matrix that is much smaller in size than the original image. This matrix is run through an activation layer, which introduces non-linearity to allow the network to train itself via backpropagation. The activation function is typically ReLu. 

Function: The activation function is usually an abstraction representing the rate of action potential firing in the neuron when a signal passes the required threshold. In its simplest form, this function is binary—that is, either the neuron is firing or not (0 or 1).

Pooling Layers (PL): “Pooling” is the process of further downsampling and reducing the size of the matrix. A filter is passed over the results of the previous layer and selects one number out of each group of values (typically the maximum, this is called max pooling). This allows the network to train much faster, focusing on the most important information in each feature of the image. 

Function: It mainly helps in extracting sharp and smooth features. It is also done to reduce variance and computations. Maxpooling helps in extracting low-level features like edges, points, etc. While Average-pooling goes for smooth features.

Fully connected layer (FC): a traditional multilayer perceptron structure. Its input is a one-dimensional vector representing the output of the previous layers. Its output is a list of probabilities for different possible labels attached to the image (e.g. dog, cat, bird). The label that receives the highest probability is the classification decision. 

Function: The objective of a fully connected layer is to take the results of the convolution/pooling process and use them to classify the image into a label (in a simple classification example).

Feature maps (FM): With the output channels, we no longer have colour channels, but modified channels that we call feature maps. These so-called feature maps are the outputs of the convolutions that take place using the input colour channels and the convolutional filters.

Function: It gives us the final output to interpret and draw meaningful insights.

Pixels in each image will be converted in a single rank-4 tensor that will ultimately flow through our convolutional neural network.

What happens to the Colour Channels? 

Each colour channel will be flattened first. Then, the flattened channels will be lined up side by side on a single axis of the tensor. There are three colour channels, red, green & blue (RGB).

Let’s build a CNN model on image dataset,

Case Study: Convolutional neural network project in PyTorch

To build a convolutional neural network for classifying images from the Fashion-MNIST dataset.

This dataset contains a training set of images (sixty thousand examples from ten different classes of clothing items). We will use PyTorch to build a convolutional neural network that can accurately predict the correct article of clothing given an input image.

The dataset consists of 70,000 images of Fashion articles with the following split:

  • 60,000 training images
  • 10,000 testing images

Fashion-MNIST is based on the assortment on Zalando’s website. Zalando is a German-based multinational fashion commerce company that was founded in 2008.

This is where the Fashion-MNIST dataset is available for download.

We will be accessing Fashion-MNIST through a PyTorch vision library called torchvision and building our first neural network that can accurately predict an output class given an input fashion image.

Before Starting:

Check your PyTorch versions: Make sure it is at least version 1.1.0

Step 1: Initialise all important libraries and modules

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import torch.optim as optim
torch.set_printoptions(linewidth =120)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
#from plotcm import plot_confusion_matrix
import pdb
#Note that pdb is the Python debugger

Step 2: Defining our model Architecture

An abstract class is a Python class that has methods we must implement telling our selection of convolution layers, activation function, pooling and fully connected layers, so we can create a custom dataset by creating a subclass that extends the functionality of the Dataset class.

Convolutional layers have three parameters and the linear layers have two parameters.

  • Convolutional layers
    • in_channels
    • out_channels
    • kernel_size
  • Linear layers


In general, hyperparameters are parameters whose values are chosen manually and arbitrarily.

Operation Output Shape
Input Image Dimensions torch.Size([1, 1, 28, 28])
Convolution (5 x 5) torch.Size([1, 6, 24, 24])
Max pooling (2 x 2) torch.Size([1, 6, 12, 12])
Convolution (5 x 5) torch.Size([1, 12, 8, 8])
Max pooling (2 x 2) torch.Size([1, 12, 4, 4])
Flatten (reshape) torch.Size([1, 192])
Linear transformation torch.Size([1, 120])
Linear transformation torch.Size([1, 60])
OutPut Dimension torch.Size([1, 10])

# Initialising the model architecture by defining Network class and instructions for forward pass

class Network(nn.Module):
    def __init__(self):
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)
        self.fc1 = nn.Linear(in_features=12*4*4, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=60)
        self.out = nn.Linear(in_features=60, out_features=10)
    def forward(self, t):
        # (1) input layer
        t = t
        # (2) hidden conv layer
        t = self.conv1(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)
        # (3) hidden conv layer
        t = self.conv2(t)
        t = F.relu(t)
        t = F.max_pool2d(t, kernel_size=2, stride=2)
        # (4) hidden linear layer
        t = t.reshape(-1, 12 * 4 * 4)
        t = self.fc1(t)
        t = F.relu(t)
        # (5) hidden linear layer
        t = self.fc2(t)
        t = F.relu(t)
        # (6) output layer
        t = self.out(t)
        #t = F.softmax(t, dim=1)
        return t

Step 3: Preparing data using PyTorch

  1. Extract – Get the Fashion-MNIST image data from the source.
  2. Transform – Put our data into tensor form.
  3. Load – Put our data into an object to make it easily accessible.
#Downloading data from torchvision 
train_set = torchvision.datasets.FashionMNIST(

PyTorch DataLoader class for loading our data and initiating for the Forward Class

network = Network()
train_loader =, batch_size=100)
batch = next(iter(train_loader)) # Getting a batch
images, labels = batch

We will be loading images batch-wise.

Step 4: Build the model & Calculating the loss

preds = network(images)
loss = F.cross_entropy(preds, labels) # Calculating the loss
def get_num_correct(preds, labels):
    return preds.argmax(dim=1).eq(labels).sum().item()
get_num_correct(preds, labels)
#Calculating the Gradients
#Updating the Weights
optimizer = optim.Adam(network.parameters(), lr=0.01)
optimizer.step() # Updating the weights
preds = network(images)

Step 5: Training on Single batch

  • Calculating the loss
  • Calculating the Gradients
  • Updating the Weights
  • Retraining
#Train Using a Single Batch
network = Network()
train_loader =, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)
batch = next(iter(train_loader)) # Get Batch
images, labels = batch
preds = network(images) # Pass Batch
loss = F.cross_entropy(preds, labels) # Calculate Loss
loss.backward() # Calculate Gradients
optimizer.step() # Update Weights
print('loss1:', loss.item())
preds = network(images)
loss = F.cross_entropy(preds, labels)
print('loss2:', loss.item())

Step 6: Training with all batches (= One Epoch) 

Running it in a loop instead of doing single batch-wise.

We have 60,000 samples in our training set, we will have 60,000 / 100 = 600 iterations done in one go.

network = Network()
train_loader =, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)
total_loss = 0
total_correct = 0
for batch in train_loader: # Get Batch
    images, labels = batch 
    preds = network(images) # Pass Batch
    loss = F.cross_entropy(preds, labels) # Calculate Loss
    loss.backward() # Calculate Gradients
    optimizer.step() # Update Weights
    total_loss += loss.item()
    total_correct += get_num_correct(preds, labels)
    "epoch:", 0, 
    "total_correct:", total_correct, 
    "loss:", total_loss

Step 7: (Optional) Training with multiple epochs for reducing errors and have better predictions

Warning: This operation will take time as it must do multiple loops on 60,000 images.

network = Network()
train_loader =, batch_size=100)
optimizer = optim.Adam(network.parameters(), lr=0.01)
for epoch in range(10):
    total_loss = 0
    total_correct = 0
    for batch in train_loader: # Get Batch
        images, labels = batch 
        preds = network(images) # Pass Batch
        loss = F.cross_entropy(preds, labels) # Calculate Loss
        loss.backward() # Calculate Gradients
        optimizer.step() # Update Weights
        total_loss += loss.item()
        total_correct += get_num_correct(preds, labels)
        "epoch", epoch, 
        "total_correct:", total_correct, 
        "loss:", total_loss

Output would be very interesting to see, how classification error reduces with each epoch

Step 8: Creating a Function to get Predictions for ALL Samples

def get_all_preds(model, loader):
    all_preds = torch.tensor([])
    for batch in loader:
        images, labels = batch
        preds = model(images)
        all_preds =
            (all_preds, preds)
    return all_preds

with torch.no_grad():
    prediction_loader =, batch_size=10000)
    train_preds = get_all_preds(network, prediction_loader)

preds_correct = get_num_correct(train_preds, train_set.targets)
print('total correct:', preds_correct)
print('accuracy:', preds_correct / len(train_set))

So, at present, our model says we have achieved 89.25% accuracy and out of 60,000 images we could predict 53550 images correctly.

Step 9: Building the Confusion Matrix

  1. Getting all predictions and targets together for comparison
  2. Making an empty data frame for the confusion matrix
  3. Getting all the elements in the confusion matrix

stacked = torch.stack(

cmt = torch.zeros(10,10, dtype=torch.int64)

for p in stacked:
    tl, pl = p.tolist()
    cmt[tl, pl] = cmt[tl, pl] + 1


Step 10: Plotting confusion matrix

import itertools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix',
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
        print('Confusion matrix, without normalization')
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

names = (
    ,'Ankle boot'
plot_confusion_matrix(cmt, names)

Some useful definitions:

API: An application program interface (API) is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact.

Tensors: Tensors are a type of data structure used in linear algebra, and like vectors and matrices, you can calculate arithmetic operations with tensors. Tensors are a generalisation of matrices and are represented using n-dimensional arrays.

Computational graphs: They are used to graph the function operations that occur on tensors inside neural networks.

GPU: A programmable logic chip (processor) specialised for display functions. The GPU renders images, animations and video for the computer’s screen. With GPU, the application runs faster because it’s using the massively parallel processing power. A CPU consists of four to eight CPU cores, while the GPU consists of hundreds of smaller cores.

CUDA: CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

Parallel Computing: It is a type of computation whereby a computation is broken into independent smaller computations that can be carried out in tandem/simultaneously.

Flattening a tensor means to remove all the dimensions except for one.

Learn more about Computer Vision and other Artificial Intelligence concepts. Upskill with Great Learning’s PG program in Artificial Intelligence and Machine Learning.

Source :