“ PERFORMANCE OF DIFFERENT NEURAL NETWORKS ON CIFAR-10 DATASET”

How different Architectures performs on CIFAR-10 dataset.

14 min readNov 26, 2018

Welcome to the Seventh Episode of Fastdotai where we will deal with performance of various neural networks on CIFAR10 images. Before we start , I would like to thank Jeremy Howard and Rachel Thomas for their efforts to democratize AI.

To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-

Before we start , I would like to mention that one of the pre-requisite to this lesson is lesson 2.1,2.2.

Lets get the party started.

IMPORT THE PACKAGES AND DOWNLOAD THE DATA:

Lets import the required packages and download the data. The data upon which modelling will be done is CIFAR10. Check more details about the data here .

DOWNLOAD THE DATA


%matplotlib inline
%reload_ext autoreload
%autoreload 2# Download data.
!wget http://pjreddie.com/media/files/cifar.tgz
!tar -xzf cifar.tgz

IMPORT THE PACKAGES AND SET THE PATH

# import libraries
from fastai.conv_learner import *# Define dataset path
PATH = "/kaggle/working/cifar/"
os.makedirs(PATH,exist_ok=True)
!ls {PATH}/train

There are images of 10 classes in the CIFAR10 dataset.So the aim of our model is to classify images into these 10 classes.

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
stats = (np.array([ 0.4914 ,  0.48216,  0.44653]), np.array([ 0.24703,  0.24349,  0.26159]))

The stats represents mean per channel and standard deviation per channel of all the images. Three numbers corresponding to RGB channel. This stats is calculated and will be used during transformation steps.

def get_data(sz,bs):
    tfms = tfms_from_stats(stats, sz, aug_tfms=[RandomFlip()], pad=sz//8)
    return ImageClassifierData.from_paths(PATH, val_name='test', tfms=tfms, bs=bs)data = get_data(32, 4)

tfms in the code above represents the transformation that will be applied to the images.
The stats helps in normalizing the image .
sz =32 is passed as a parameter using data = get_data(32, 4) . It means randomly pick a 32 by 32 spot within the image. And add (pad=sz//8) = 4 pixel of black padding around it .
aug_tfms=[RandomFlip()] represents specific type of augmentation that is needed to be applied to the images.
Finally, the model data is prepared as per fastai format using ImageClassifierData.from_paths(PATH, val_name=’test’, tfms=tfms, bs=bs) .
Check out the data

x, y= next(iter(data.trn_dl))
x.size(),y.size()
# Input Data , Labels

x.size() is (minibatch, channels, height, width).
y.size() is labels. Check out the first image in the minibatch.

plt.imshow(data.trn_ds.denorm(x)[0]);

2. FULLY CONNECTED MODEL PERFORMANCE ON CIFAR10:

# full model training
bs = 256
data = get_data(32, bs)
lr = 1e-2class SimpleNet(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(layers[i], layers[i + 1]) for i in range(len(layers) - 1)])
        
    def forward(self, x):
        x = x.view(x.size(0), -1)
        for l in self.layers:
            l_x = l(x)
            x = F.relu(l_x)
        return F.log_softmax(l_x, dim=-1)learn = ConvLearner.from_model_data(SimpleNet([32*32*3, 40,10]), data)

super().__init__() :-To inherit all the functionalities from PyTorch’s nn.Module .
self.layers=nn.ModuleList(....) : Create a list of layers efficiently.
When class SimpleNet([32*32*3, 40,10]) is initialized, the constructor ,def __init__(self,layers): gets automatically executed.
The forward function is executed during the training part.
Once the data is passed , its flattened using x.view(x.size(0),-1) command , which means to preserve the very first dimension x.size(0) and multiply all the other dimensions. -1 means however long could be the other dimension.
Pass the data through the layers , ReLu and finally through the log_softmax.
Instead of calling fit , create a learn object from a custom model.Parameters involve a model and data as shown below.

learn = ConvLearner.from_model_data(SimpleNet([32*32*3, 40,10]), data)

Let’s check out the parameters involved in the Neural Network.

Number of parameters used in each layer. Total number of parameters =123330.

Repeat the same old steps of learning rate finder and training:-
LR finder:-

learn.lr_find()
learn.sched.plot()

%time learn.fit(lr, 2)

%time learn.fit(lr, 2, cycle_len=1)

As seen in the snapshot above , this is the accuracy and time taken for the model to run, when only fully connected layered model is used . The upcoming topics deals with more optimized models which will give better accuracy in less time period.
In SimpleNet, there were lots of parameters, but most of them weren’t used efficiently. There was a single weight associated with a pixel.
Hence the accuracy of 47.24% .
For efficient use , replace Fully Connected Model with Convolutional Layer Model.

3. CONVOLUTIONAL LAYER MODEL PERFORMANCE ON CIFAR10

A fully CNN model is one where all the layers are Convolutional Layers except the last layer. The last layer is a fully connected one.

class ConvNet(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(layers[i], layers[i + 1], kernel_size=3, stride=2)
            for i in range(len(layers) - 1)])
        self.pool = nn.AdaptiveMaxPool2d(1)
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        for l in self.layers: x = F.relu(l(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)learn = ConvLearner.from_model_data(ConvNet([3, 20, 40, 80], 10), data)
learn.summary()

self.layers = nn.ModuleList([ nn.Conv2d(layers[i], layers[i + 1], kernel_size=3, stride=2) for i in range(len(layers) — 1)]) is how the layers in the network is created. layers[i], layers[i + 1] indicates number of features coming in and number of features going out. layers[i + 1] also indicates number of filters to be used .
kernel_size=3 is the size of filters that has to be used .
stride=2 helps in reducing the size of feature maps.
ConvNet([3, 20, 40, 80], 10) shows that the first Convolutional layer gets an input having 3 channel and spits out feature maps having 20 number of output channels . The 2nd Convolutional layer takes in 20 channels and spits out 40 channels. The 3rd Convolutional layer takes in 40 channels and spits out 80 channels. Finally , the last layer which is a fully connected layer spits 10 activations.
self.pool = nn.AdaptiveMaxPool2d(1) :-In our modern CNN, a (1 by 1) AdaptiveMaxPool2d is implemented in the penultimate layer . The AdaptiveMaxPool2d tells how big resolution is needed in the output .
On a (28,28) input , (14,14) AdaptiveMaxpool is same as (2,2) normal maxpool.
self.out = nn.Linear(layers[-1], c):- The final layer is a fully connected layer .It takes the last layer dimension as input as shown here layers[-1] and spits out c classes activations.

During the training i.e forward method , go through:-

Every convolutional layers.
Perform convolution operation followed by passing it via Relu activation.
Do a AdaptiveMaxpooling .
Flatten the output of AdaptiveMaxpooling.
Perform a log_softmax on top of that.
Find out the best learning rate.

learn.lr_find(end_lr=100)# end_lr=100 overrides the default learning rate. This is done in cases where the loss keeps on decreasing even after the default end_lr.learn.sched.plot()

Train the model for couple of epochs.

%time learn.fit(1e-1, 2)%time learn.fit(1e-1, 4, cycle_len=1)

As seen from the snapshot above , the accuracy has increased to 58.17%. The total number of parameters used to train this network is :-

Total number of parameters used in ConvNet is 37490 and accuracy is 58.17% . Total number of parameters used in SimpleNet is 123330 and accuracy is 47.24%. This shows ConvNet is much better than SimpleNet.

4. REFACTORED CONVOLUTIONAL LAYER MODEL PERFORMANCE ON CIFAR10.

In the refactored code , there are two different class . One for Neural Network layer and the other for Neural Network Model .

# Convolutional Neural Network layer definition:-
class ConvLayer(nn.Module):
    def __init__(self, ni, nf):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=3, stride=2, padding=1)
        
    def forward(self, x): return F.relu(self.conv(x))# Convolutional Neural Network model definition:-
class ConvNet2(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.layers = nn.ModuleList([ConvLayer(layers[i], layers[i + 1])
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        for l in self.layers: x = l(x)
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return (F.log_softmax(self.out(x), dim=-1))learn = ConvLearner.from_model_data(ConvNet2([3, 20, 40, 80], 10), data)

Note:- The layer definition and the Neural Network model definition are identical . Same constructor and forward function.

%time learn.fit(1e-1, 2)
%time learn.fit(1e-1, 2, cycle_len=1)

After training it for a while, its seen that the accuracy doesn’t improves much . Hence , its better to go for other approaches like BatchNorm.

When more layers are added , there is problem in training . A high learning rate if used , then activation values will blow off to NaN and if smaller learning rates are used , it will take some time to converge. Hence its not resilient. To make it resilient , use BatchNorm. It helps in training deeper networks.

Batchnorm can be simply implemented using nn.Batchnorm(…) , but lets implement it from scratch.

5. BATCHNORM LAYERED MODEL PERFORMANCE ON CIFAR10

Before starting with BatchNorm , lets see why is it required to have BatchNorm?

Consider Case 1, where Activations are getting multiplied with updated values of weights . Suppose those updated values of weights is an identity matrix , it won’t have much effect on the values of activation after matrix multiplication. Hence its okay .

Now lets check out Case 2, where Activations are getting multiplied with updated values of weights . Suppose those updated values of weights is as shown a diagonal matrix of value 2 , the output activations are doubling each time and will lead to exponential growth soon. Hence , its needed to normalize every layer along with the input. This is known as Batch Normalization. Let’s check the code of Batch Norm layer.

class BnLayer(nn.Module):
    def __init__(self, ni, nf, stride=2, kernel_size=3):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=kernel_size, stride=stride,bias=False, padding=1)        self.a = nn.Parameter(torch.zeros(nf,1,1))
        self.m = nn.Parameter(torch.ones(nf,1,1))
        
    def forward(self, x):
        x = F.relu(self.conv(x))
        x_chan = x.transpose(0,1).contiguous().view(x.size(1), -1)
        if self.training:
            self.means = x_chan.mean(1)[:,None,None]
            self.stds  = x_chan.std (1)[:,None,None]
        return (x-self.means) / self.stds *self.m + self.a

In the constructor part , all the variables which has weights are declared upon , which includes (convolutional filter( self.conv),self.a,self.m ).
The forward method calculates the mean and the standard deviation for each of the channels.
Finally, subtract the original value from its mean and divide it by the standard deviation.
One problem is that Stochastic Gradient Descent is pretty bloody minded .If it decides the weight matrix to have certain values , where that matrix is something which is gonna increase the values of activations, it will do so irrespective of the fact that its being normalized. It undoes the normalization effect in the next mini-batch.
To counter that , two parameterized variables are initialized :
self.m = nn.Parameter(torch.ones(nf,1,1)) :- Multiplier having three ones as parameters.
self.a = nn.Parameter(torch.zeros(nf,1,1)) :- Parameters consisting of three ones . These will be added.
self.a , self.m help in nullifying the effect of SGD. These parameters will be updated . Basically what it does is , if it wants to scale up every single value in the matrix , it can scale up single trio of numbers (self.m) .
To shift all up or down , it doesn’t have to shift the entire weight matrix , instead it needs to work upon self.a .

BASIC INTUITION FOR BATCH NORMALIZATION:-

Here the data is being normalized , post that it can be shifted and scaled using very few parameters or else the entire set of convolution filters was needed to be shifted and scaled .
This allows us to increase our learning rate and this increases the resilience of training by allowing us to add more layer.
BatchNorm has a regularization effect . Hence weight decay and dropouts can be removed or reduced when batch norm is used . Each mini-batch is going to have a different mean and standard deviation as compared to the previous mini-batch, so these things keep changing and it changes the meaning of the filter in a certain way , so its adding a regularization effect because its noise. And when noise of any kind is added it regularizes the model.
if self.training: means this batch normalization will be true only over training loop . And when applied to validation , this batch normalization will be set to False. If applied to the validation dataset , it will change the meaning of the model.
These means and standard deviation gets updated in the training mode in all the libraries in the training loop regardless of the fact whether that layer is set to be trainable or not .
It turns out its a terrible idea with pre-trained network.In a pre-trained network , if the specific values of those means and standard deviation in batch norm is changed , it changes the meaning of the pre-trained layers.
In Fastai, the means and standard deviation of the layers won’t be changed if its frozen or pre-trained.
If its unfrozen , it starts updating these values.
If it has been set to learn.bn_freeze=True ,it implies never touch these mean and standard deviation.

Note 1:- Batch Normalization to be put after the ReLu.

Note 2:- In real version of Batch Norm, we don’t just use this batches mean and standard deviation , but instead take an exponentially weighted moving average standard deviation.

Note 3:- For better results find a way to apply BatchNorm , even if its in RNN.

What should be our approach when it comes to modelling on Images?

Test it on a small dataset , having same kind of properties as larger dataset and hence our conclusion is likely to carry forward and test that. Most people don’t have 1 million images but somewhere between 2000 to 20000 images .

Lets do a deep dive into the State of Art Batch Normalization.

class BnLayer(nn.Module):
    def __init__(self, ni, nf, stride=2, kernel_size=3):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=kernel_size, stride=stride,bias=False, padding=1)        self.a = nn.Parameter(torch.zeros(nf,1,1))
        self.m = nn.Parameter(torch.ones(nf,1,1))
        
    def forward(self, x):
        x = F.relu(self.conv(x))
        x_chan = x.transpose(0,1).contiguous().view(x.size(1), -1)
        if self.training:
            self.means = x_chan.mean(1)[:,None,None]
            self.stds  = x_chan.std (1)[:,None,None]
        return (x-self.means) / self.stds *self.m + self.a
class ConvBnNet(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
        self.layers = nn.ModuleList([BnLayer(layers[i], layers[i + 1])
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        x = self.conv1(x)
        for l in self.layers: x = l(x)
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)learn = ConvLearner.from_model_data(ConvBnNet([10, 20, 40, 80, 160], 10), data)

The BnLayer has been discussed before. The ConvBnNet is the neural network architecture which calls the BnLayer.
self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2) .Using this line , add a single convolutional layer at the start with a bigger kernel size and stride=1 .This is done so that our first layer has richer input .This is shown in the below snapshot.

In the above snapshot , input image is (32,32,3) and filter size is (5,5,3) , stride=1 and padding=2. This results in an output feature map of dimension (32,32,10) . As seen the filter size is of bigger dimension (5,5) and doing a convolution with a bigger dimension helps in covering a large portion of the image and extract interesting features out of it.
Pretty much every State of the Art convolutional architecture starts with a single convolutional layer with a big dimensional filter such as (5,5,),(7,7,),(11,11) filter.This is a good way to create a richer starting point for sequence of convolution layer.
Post this , everything is same as ConvNet architecture that has been discussed before.
Do the training and check how the accuracy has risen up after using batchnorm.

%time learn.fit(3e-2, 2)
%time learn.fit(1e-1, 4, cycle_len=1)

The accuracy has been increased to 70.16% as compared to ConvNet where the accuracy was 58.17%.

6. DEEP BATCHNORM LAYERED MODEL PERFORMANCE ON CIFAR10.

To increase our model accuracy further, increase the number of layers . Increasing the model layers will lead to increase in number of parameters. Hence with more parameters the model will learn better. Hence better accuracy. Lets check out the code for Deep BatchNorm.

class BnLayer(nn.Module):
    def __init__(self, ni, nf, stride=2, kernel_size=3):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, kernel_size=kernel_size, stride=stride,bias=False, padding=1)self.a = nn.Parameter(torch.zeros(nf,1,1))
        self.m = nn.Parameter(torch.ones(nf,1,1))
        
    def forward(self, x):
        x = F.relu(self.conv(x))
        x_chan = x.transpose(0,1).contiguous().view(x.size(1), -1)
        if self.training:
            self.means = x_chan.mean(1)[:,None,None]
            self.stds  = x_chan.std (1)[:,None,None]
        return (x-self.means) / self.stds *self.m + self.a
class ConvBnNet2(nn.Module):
    def __init__(self, layers, c):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 10, kernel_size=5, stride=1, padding=2)
        self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1])
            for i in range(len(layers) - 1)])
        self.layers2 = nn.ModuleList([BnLayer(layers[i+1], layers[i + 1], 1)
            for i in range(len(layers) - 1)])
        self.out = nn.Linear(layers[-1], c)
        
    def forward(self, x):
        x = self.conv1(x)
        for l,l2 in zip(self.layers, self.layers2):
            x = l(x)
            x = l2(x)
        x = F.adaptive_max_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return F.log_softmax(self.out(x), dim=-1)

The commands in bold in the above code , are the distinct characteristics which separates a BatchNorm Neural Network from a Deep BatchNorm Neural Network.
self.layers = nn.ModuleList([BnLayer(layers[i], layers[i+1]) for i in range(len(layers) — 1)]) :- Using this , introduce a stride 2 layer for each layer.
self.layers2 = nn.ModuleList([BnLayer(layers[i+1], layers[i + 1], 1) for i in range(len(layers) — 1)]) :- Using this , introduce a stride 1 layer for each layer.
It is said ‘A picture is worth thousand words ’. So the whole deep BatchNorm architecture has been described in the below image.

Train this network for couple of epochs.

learn = ConvLearner.from_model_data(ConvBnNet2([10, 20, 40, 80, 160], 10), data)%time learn.fit(1e-2, 2)
%time learn.fit(1e-2, 2, cycle_len=1)

As seen , the accuracy is around 66.4%. Not doing better as compared to normal BatchNorm . One of the cons of BatchNorm is that it cannot handle when the network gets deeper. To resolve this , ResNet comes into picture, which will be discussed in the next blog post.

!!! Congratulations on completing another Lesson on fast.ai . Well done . !!!

If you like it , then ABC (Always be clapping . 👏 👏👏👏👏😃😃😃😃😃😃😃😃😃👏 👏👏👏👏👏)

If you have any questions, feel free to reach out on the fast.ai forums or on Twitter:@ashiskumarpanda. Check out the code here. Also special thanks to Sharwon Pius for putting up this kaggle kernel and helping us through this lesson.

P.S. -This blog post will be updated and improved as I further continue with other lessons. For more interesting stuff , Feel free to checkout my Github account.

To make best out of this blog post Series , feel free to explore the first Part of this Series in the following order:-Dog Vs Cat Image Classification

Edit 1:- TFW Jeremy Howard approves of your post . 💖💖 🙌🙌🙌 💖💖 .

“ PERFORMANCE OF DIFFERENT NEURAL NETWORKS ON CIFAR-10 DATASET”

How different Architectures performs on CIFAR-10 dataset.

Written by Ashis Kumar Panda