PyTorch tutorial – Creating Convolutional Neural Network [2020]

What is Convolutional Neural Network

In our previous article, we have discussed how a simple neural network works. The problem with fully connected neural networks is that they are computationally expensive. Also, by adding lots of layers we come across some problems:

1. We run into a problem of vanishing gradient problem.

2. Another issue is that the number of trainable parameters grows rapidly. Because of this training slows down or became practically impossible.

3. A fully connected neural network can be easily exposed to overfitting

Convolutional Neural Network(or CNN) can solve this problem by finding correlations between adjacent input between dataset(eg. image or time series). This means that not every node in the network is connected to every other node in the next layer and this cut down the number of weight parameters required to be trained in the model.

How Convolutional Network works

According to Wikipedia, a convolution is an operation between to function the produce third function expression how the shape of one modified by the other. In a convolutional neural network, this means that we move a window or filter across the image being studied. We can imagine this as a 2×2(or 3×3) filter sliding across all the available nodes/ pixels in the input image.

PyTorch tutorial – Creating Convolutional Neural Network [2020]
CNN (source Wikipedia)
Now let’s look at some of the important term used in CNN

Feature mapping (or activation map)

Whenever we use a filter matrix(or sliding window) across the image to compute the convolution operation the resultant matrix is called Feature Map. Because of this mapping, our network learns to recognize common geometrical objects such as edge, line, etc.

Convolution layers need multiple filters(For example an image consists of 3 channel Red, Green, and Blue). These multiple filters will produce their own 2D output, then these outputs are then combined to produce a single output (Feature map or activation map). Each channel will be trained to detect certain key features. The output dimensions of the feature map will be 1 more than the input channel.


Polling is a type of sliding window technique, but instead of applying weights, we apply some statistical functions of some type over the content window. The most common type of polling is max polling. And the output of max polling is the maximum value in that content window.

PyTorch tutorial – Creating Convolutional Neural Network [2020]
Max Polling (source Wikipedia)


Strides are the number of pixels that shits over the input image. For example, if strides are 1 then we move the window 1 pixels at a time when it is 2 we move the windows 2 pixels at a time and so on.

In pooling, if strides are greater than 1 then the output size will reduce. For example, if the input image is of size 5×5 and if we apply strides of 2 then our output will be of 3×3 this process is called downsampling and it reduces the number of training parameter


Sometimes our sliding window does not fit perfectly with the input image. So we have two options:

  •  Pad the image with zero(also called zero-padding).
  •  Drop or clip the part of the image where our filter window did not fit. This is also known as valid padding which keeps only a valid part of the image.

Implementation of Convolutional Neural Network

First, let’s import our necessary libraries:

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets,transforms
from tqdm import tqdm

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def get_data(batch_size = 200,train = True):
    data =
        datasets.MNIST('../data', train=train, download=True,
                           transforms.Normalize((0.1307,), (0.3081,))
        batch_size=batch_size, shuffle=True)
    return data

Now let’s create our model class

class Model(nn.Module):
    def __init__(self):
        self.l1 = nn.Sequential(
                nn.Conv2d(1,32,kernel_size = 5,stride = 1,padding = 2),
                nn.MaxPool2d(kernel_size = 2, stride = 2)
        self.l2 = nn.Sequential(
                nn.Conv2d(32,64,kernel_size = 5,stride = 1,padding = 2),
                nn.MaxPool2d(kernel_size = 2, stride = 2)
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(7*7*64,1000)
        self.fc2 = nn.Linear(1000,10)

As we have discussed in our last section to create a neural network all you need is to inherit the nn.Module class in your model.

Next, we create a sequential object. The nn.Sequential() method allows us to create a sequential ordered layer in our network. As in our case, we have created convolution->ReLU->pooling sequence.

The nn.Conv2d() method is used the create set of convolution filters and we have passed to parameters first is our input channel which is 1 for grayscale image and the next is the number of output channels. The kernel_size argument is the size of the sliding window of our image and the last two-argument are strides and padding.

To know the dimensions of our output image we have the formula:

Wout = ((Win – F + 2P)/S) + 1 (Same formula is calculated for Height)

Wout = Width of output

F = Filter size(or window size)

P = Padding

S = Stride

If we wish to keep our input and output dimensions same we have to take kernel_size = 5 and stride = 1. If we put all the value in the above formula we get padding = 2.

nn.ReLU() is our activation function. The last element on our sequential object of self.l1 is nn.MaxPool2d. The first argument given to this method is the pooling size which we have given as 2×2. Next, we want to downsample our data by giving the stride 2. So our 28×28 pixel image will now be 14×14 pixel(You can also put the value in the above formula, you will get Wout = 14 if you put padding = 0).

Similarly, we have defined our second layer i.e self.l2.After this operation our output will be of dimensions 7×7 pixels.

Next, we use nn.Dropout to avoid overfitting. Finally, we create two fully connected layers self.fc1 and self.fc2 by using nn.Linear().The input size of the first layer will be 7x7x64 which is dimensions of image and number of the channel from the previous Sequential layer and this is connected to 1000 nodes. Finally, this 1000 node is then connected to 10 nodes.

Now let’s write our froward() method to tell our network how our data should flow.

        def forward(self,x):
            x = self.l1(x)
            x = self.l2(x)
            x = x.reshape(x.size(0),-1)
            x = self.drop_out(x)
            x = self.fc1(x)
            x = self.fc2(x)
            x = x.reshape(x.size(0),-1)

        return x

The first two lines of forward is straight forward. After the line, self.l2(x) we apply reshape method which flattens our data of dimensions 7x7x64 to 3136×1. Next dropout is applied followed by two fully connected layers,with the final output being returned.

Now we create our main function:

if __name__  == '__main__':
    cnn = Model()
    if torch.cuda.is_available():

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(cnn.parameters(),lr = 0.01)

First of all, we create an instance of our Model class called cnn. Next, we define our loss function which in our case is nn.CrossEntropyLoss(). Then we define our optimize, here I have used Adam optimizer, which takes the model parameter as it’s first input and we have given the learning rate to be 0.01.

Now let’s train our model:

    train = get_data()
    for epoch in tqdm(range(10)):
        for i,(images,target) in enumerate(train):
            images =
            target =

            out = cnn(images)
            loss = criterion(out,target)

            # Back-propogation

            _,pred = torch.max(,1)
            correct = (pred == target).sum().item()

            if i % 100 == 0:
                print(f" epoch: {epoch}\tloss: {}\tAccuracy: {(correct/target.size(0)) * 100}%")

In training, I have used 10 epochs. Next, we iterate to our DataLoader object(i.e train).Then we pass our datasets(i.e images) to our model. Using criterion we calculate our loss. Next step is to perform back-propagation, first, we make gradient to be zero which is done by using optimizer.zero_grad(). Next, we call backword() on our loss variable to perform back-propagation. After the gradient has been calculated we optimize our model by using optimizer.step() method.

Now to test model:

    # Testing
    test = get_data(train = False)

    with torch.no_grad():
        correct = 0
        total = 0
        for i,(images,target) in tqdm(enumerate(test)):
            images =
            target =

            out = cnn(images)
            _,pred = torch.max(,1)
            total += target.size(0)
            correct += (pred == target).sum().item()
        print(f"Accuracy: {(correct/total) * 100}")

The testing model is pretty much similar to the training model except, in the testing model, we don’t want to update the weights of our model. For this, we use torch.no_grad() which tell PyTorch not to update weight.


epoch: 6 loss: 0.35762882232666016 Accuracy: 86.0% epoch: 6 loss: 0.345419317483902 Accuracy: 88.5% epoch: 6 loss: 0.3479294180870056 Accuracy: 90.0% 70%|█████████████████████████████████████████████████████████████████▊ | 7/10 [08:18<03:34, 71.46s/it] epoch: 7 loss: 0.5705909729003906 Accuracy: 85.0% epoch: 7 loss: 0.36838656663894653 Accuracy: 89.5% epoch: 7 loss: 0.2633790671825409 Accuracy: 91.5% 80%|███████████████████████████████████████████████████████████████████████████▏ | 8/10 [09:30<02:23, 71.59s/it] epoch: 8 loss: 0.31612157821655273 Accuracy: 89.0% epoch: 8 loss: 0.2783728539943695 Accuracy: 91.0% epoch: 8 loss: 0.38499078154563904 Accuracy: 90.5% 90%|████████████████████████████████████████████████████████████████████████████████████▌ | 9/10 [10:42<01:11, 71.68s/it] epoch: 9 loss: 0.2609010934829712 Accuracy: 91.5% epoch: 9 loss: 0.32276371121406555 Accuracy: 88.5% epoch: 9 loss: 0.4422188997268677 Accuracy: 87.0% 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [11:54<00:00, 71.45s/it] 300it [00:30, 9.96it/s]

Accuracy: 89.43333333333334

Sharing is caring!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top