Multi-class Image classification with CNN using PyTorch, and the basics of Convolutional Neural Network.

Vatsal Saglani
9 min readJun 27, 2019

--

I know there are many blogs about CNN and multi-class classification, but maybe this blog wouldn’t be that similar to the other blogs. Yes, it does have some theory, and no the multi-class classification is not performed on the MNIST dataset. In this blog, multi-class classification is performed on an apparel dataset consisting of 15 different categories of clothes. The classes will be mentioned as we go through the coding part. The contents and links to various parts of the blogs are given below,

1. Bare bones of CNN

Generally, in CNN, the set of images is first multiplied with the convolution kernel in a sliding window fashion, and then pooling is performed on the convoluted output and later on, the image is flattened and passed to the Linear layer for classification. The following are the steps involved,

Step 1: Convolution

Step 2: Pooling

Provided the kernel size to be (2, 2) the kernel goes through the whole image as shown in the pictures and performs the selected pooling operation.

In Max Pooling the maximum value pixel is chosen and in Average Pooling the average of all the pixels is taken.

Step 3: Non-Linear Activation

Step 4: Flatten

Step 5: Linear layer and Classification

2. Layers involved in CNN

2.1 Linear Layer

The transformation y = Wx + b is applied at the linear layer, where W is the weight, b is the bias, y is the desired output, and x is the input. There are various naming conventions to a Linear layer, its also called Dense layer or Fully Connected layer (FC Layer).
With Deep Learning, we tend to have many layers stacked on top of each other with different weights and biases, which helps the network to learn various nuances of the data. But when we think about Linear layer stacked over a Linear layer, then it’s quite unfruitful.
Now, let’s assume we have two different networks on having two Linear layers with weights 5 and 6 respectively and other having a single linear layer with weight 30 and no biases are considered for both the networks. Let’s look at how the inputs to these layers look like,

2.2 Non-Linear Activation Functions

2.2.1 Binary Step:

2.2.2 Logistic (a.k.a. Soft Step):

2.2.3 TanH:

2.2.4 ArcTan:

2.2.5 Rectified Linear Unit (ReLU):

2.2.6 Parametric Linear Unit(pReLU):

2.2.7 Exponential Linear Unit (ELU):

2.2.8 Softplus:

2.2.9 Softmax:

Softmax function squashes the outputs of each unit to be between 0 and 1, similar to the sigmoid function but here it also divides the outputs such that the total sum of all the outputs equals to 1.

3. Loss Functions

We check the performance of our model via the loss function and loss functions differ from problem to problem.

4. Optimizers

During the training process, we tweak and change the parameter of our model to try and minimize the loss function. The optimizers tie together the loss function and model parameters by updating the model in response to the output of the loss function. They shape and mold the model into its most accurate form. Loss function acts as a guide for the model to move in the right direction. To learn more about various optimizers, follow this link.

Coding begins Here!!

5. DataLoader and trick of the trade

Imports

Check for CUDA

Partitioning data into Train and Valid

*If you get your “ipynb_checkpoints” in your same directories as your image folders remove those checkpoints and then run the train and valid cell blocks or else they would also be counted as classes.

6. Defining our model

Extra: Selecting the number of “In Features” for the first Linear layer after all the convolution blocks.

I have always struggled in counting the number of “In Features” at the first Linear layer and have ever thought that it must be the Output Channels * Width * Height. Yes, we do calculate the number of “In Features” with this formula only but the process to obtain the height and width has a method involved and let’s check it out.

W = Width

F = Filter_size/Kernel_size

P = Padding

S = Stride

Let’s see this with an example of our own model i.e.

Below we will go through the stages through which we got the number “15488” as the “In Features” for our first Linear layer.

Using the formula at every convolution step, we get the height and width of the image, and at the pooling stage, we divide the height and the width by the kernel_size we provided in pooling, for example, if we provide kernel_size = 2 inside the pooling stage we divide the height and width also by 2 respectively. And thus at the end, we obtain the number “15488” as the total number of “In Features” for the first Linear layer after all the convolution blocks.

Let’s also look at it in a Layer by Layer fashion,

Part — 0
Part — 1
Part — 2
Part — 3
Part — 4

The above four images show how an image batch passes through our architecture and how the output is calculated

7. Training and Validation

Create the model instance

Training the network

Validation accuracy

Save the whole model

Loading the saved model and training again

Validation Accuracy after certain number of epochs

First 200 epochs

After 500 epochs

After 850 epochs

Let’s consider the odds of selecting right apparel out of all the images i.e. if randomly we choose any garment out of the 15 categories the odds of choosing what we want is 1/15 i.e., 6.66%, approximately 7%. But with our model architecture (no pre-trained weights) trained on the images for 850 epochs we get an accuracy of 47%, i.e., now the chances of getting an apparel right is 47%, and we can still increase the accuracy of our model by adding more convolution blocks and even training it for more number of epochs. (*its just my free compute quota on GCP got over so couldn’t train for more number of epochs 😬.). This is a simple architecture, we can also add batchnormalize, change the activation functions, moreover try different optimizers with different learning rates.

But before designing the model architecture and training it, I first trained a ResNet50 (pre-trained weights) on the images using FastAI.

8. Training Using FastAI and ResNet50

To setup FastAI on your machine or any cloud platform instance follow this link.

Imports

Images path

Create DataBunch

Select model, create a learner, and start training

Using ResNet50 over FastAI (just 5 lines of code 😯) and training for 14 epochs we get an accuracy of 84% which is more better than our model architecture.

References

If there are any mistakes feel free to point those in the comments section below. I have been working on Deep Learning projects but this is my first blog about Deep Learning. Rachel Thomas’ article on why you should blog motivated me enough to publish this, it’s a good read give it a try.

If you liked the article, please give a clap or two or any amount you could afford and share it with your other geeks and nerds like me and you 😁.

To know more about me please click here and if you find something interesting just shoot me a mail and if possible we could have a chat over a cup of ☕️. For updated contents of this blog, you can visit https://blogs.vatsal.ml

P.S.: The code base is still quite messy will gradually update it on GitHub.

Support this content 😃 😇

https://www.buymeacoffee.com/vatsalsaglani

I have always believed in the fact that knowledge must be shared without thinking about any rewards, the more you share the more you learn. Writing a blog tutorial takes a lot of time in background research work, organizing the content, and showing proper steps to follow. The deep learning blog tutorials require a GPU server to train the models on and they quite cost a bomb because all the models are trained overnight. I will be posting all the content for free like always but if you like the content and the hands-on coding approach of every blog you can support me at https://www.buymeacoffee.com/vatsalsaglani, . Thanks 🙏

--

--

Vatsal Saglani
Vatsal Saglani