Computer Vision: A Case Study- Transfer Learning

The conclusion to the series on computer vision talks about the benefits of transfer learning and how anyone can train networks with reasonable accuracy. Usually, articles and tutorials on the web don’t include methods and hacks to improve accuracy. The aim of this article is to help you get the most information from one source. Stick on till the end to build your own classifier. 

Also Read: Computer Vision Series Part I - Computer Vision: Low-level Vision 
And: Computer Vision Series Part II - Computer Vision: Deep Learning Approach 

The ImageNet moment was remarkable in computer vision and deep learning, as it created opportunities for people to reuse the knowledge procured through several hours or days of training with high-end GPUs. The different architectures can recognise over 20,000 classes of various objects and have achieved better accuracy than humans. How do we use this knowledge that scientists across the globe have gathered? The solution is transfer learning. Just as how a teacher teaches us class 8 mathematics which is built upon concepts learnt from classes 1-7, similarly, we can use the existing knowledge to suit our own needs. In this article, we will discuss transfer learning in its entirety and some common hacks that are required to increase the accuracy of outputs. Also, check out this computer vision essentials course and equip yourself with a hands on set of skills.

We will take an experimental approach with data, hyper-parameters and loss functions. Through the process of experimentation, we will discover the various techniques, concepts and hacks that would be helpful during the process of transfer learning. We will work with food-101 dataset that has 1000 images per class, and comprises 101 classes of food. 

We performed a series of experiments in every step of the training to identify the ideal loss, ideal hyper-parameters to achieve better results. The role of experimentation is to find out what works best according to the dataset. It requires this because not all datasets have the same features and type of data. Thus, a common approach for the same is to split the dataset into training, testing, and validation sets. The model is trained on the training set and then tested on the validation set to ensure overfitting/underfitting has not occurred. Once, we have a good score on both training and validation set; Only then do we expose our model to the test set. Thus, the validation set can be thought of as part of a dataset that is used to find the optimal conditions for best performance. 

Before we understand the parameters that need to be adjusted, let’s dive deep into transfer learning. Revise your concepts with Introduction to Transfer Learning.

What are the types of transfer learning?

  1. Freeze Convolutional Base Model
  2. Train selected top layers in the base model
  3. Combination of steps a and b.

The convolutional base model refers to the original model architecture that we will use. It is a choice between using the entire model along with its weights, or freezing the model partially. In the first case, the initial weights are the model’s trained weights, and we will fine-tune all the layers according to our dataset. In the latter case, although the initial weights are the model’s pre-trained weights itself, the initial layers in the model are frozen. By freezing a layer, we are referring to the property of not updating the weights during training. This is to ensure that the number of trainable parameters is less. We freeze the initial layers as they identify low-level features such as edges, corners, and thus these features are independent of the dataset. 

What are some of the parameters that need to be adjusted to ensure optimal performance?

  1. Learning Rate
  2. Model Architecture
  3. Type of transfer learning
  4. Optimisation technique
  5. Regularisation 

We will consider a variety of experiments regarding the choice of optimiser, learning rate values, etc. We encourage readers to think of more ways to understand and implement. The experiments that have been performed are as follow:

1. Choice of optimiser

  1. SGD with momentum update
  2. SGD with Nesterov Momentum update
  3. Adam

2. Learning Rate Scheduling

  1. Same learning rate
  2. Step Decay
  3. Polynomial Decay #works well initially
  4. Cyclical Learning Rate # used this finally

3. Model Selection

  1. Resnet50 – Tried, but took massive amounts of time per epoch, hence didn’t proceed further
  2. InceptionV3 – Stuck with this model and decreased image size to 96*96*3

4. Transfer Learning Type

  1. Freeze Convolutional Base Model
  2. Train selected the top layers in the base model
  3. Combination of steps a and b. # This model worked well in increasing validation accuracy

5. Number of neurons and Dropout values

  1. 128 – number of neurons + 0.5 – probability
  2. 128 – number of neurons +0.25 – probability # Used this combination, as others increased the number of parameters massively.
  3. 256 – number of neurons + 0.25 – probability
  4. 256 – number of neurons + 0.5 – probability
  5. 512 – number of neurons + 0.5 – probability
  6. 512 – number of neurons + 0.25 – probability

6. GlobalAveragePooling2D vs GlobalMaxPooling2D

GlobalMaxPooling2D works better as a regularisation agent and also improves training accuracy when compared to GlobalAveragePooling2D. We did a comparison among the pooling techniques to study the role of pooling techniques as regularisation agent. 

Before starting a project, we should come up with an outline of the project deliverables and outcomes expected. Based on the conclusions made, list out the possible logical steps needed to be taken to complete the task. 

What are the steps to be followed for training?

  1. Define a model
  2. Find ideal initial learning rate
  3. Create a module for scheduling the learning rate
  4. Augment the Images
  5. Apply the transformation(mean subtraction) for better fine-tuning
  6. Test on a smaller set
  7. Fit the model
  8. Test the model on random images
  9. Visualise the kernels to validate if the training has been successful.


We will begin coding right away. We suggest you open your text editor or IDE and start coding as you read the blog. You can download the dataset from the official website, which can be found via a simple Google search: Food-101 dataset.

1. import keras.backend as K
2. from keras import regularizers
3. from keras.applications.inception_v3 import InceptionV3
4. from keras.models import Model
5. from keras.layers import Dense, Dropout, Flatten
6. from keras.layers import GlobalMaxPooling2D
7. from keras.preprocessing.image import ImageDataGenerator, img_to_array
8. from keras.callbacks import ModelCheckpoint, CSVLogger
9. from keras.regularizers import l2
10. import keras 
11. import numpy as np
12. from imutils import paths
13. import cv2
14. import os
15. import keract

16. from keras.preprocessing import image
17. from keras.callbacks import LearningRateScheduler, EarlyStopping

18. from keras.optimizers import SGD, Adam
19. from sklearn.preprocessing import LabelBinarizer
20. from sklearn.metrics import classification_report

21. from lrs import config
22. from lrs.learningratefinder import LearningRateFinder
23. from lrs.clr_callback import CyclicLR

24. import matplotlib.pyplot as plt

25. import tensorflow as tf

26. import matplotlib.pyplot as plt
27. import matplotlib.image as img
28. import numpy as np
29. from collections import defaultdict
30. import collections
31. import os
32. from IPython.display import Image

In the lines 1-32, we have imported all the libraries that will be required.

33. #Define all parameters used in the notebook here
34. NUM_EPOCHS = 100
35. INIT_LR = 5e-2
36. img_width, img_height = 96, 96
37. batch_size = 64

In lines 33-37, we define the parameters that will be used frequently within the article.

38. inception = InceptionV3(weights= 'imagenet' , include_top=False)
39. out = inception.output
40. out = GlobalMaxPooling2D()(out)
41. out = Dense(128,activation='relu')(out)
42. out = Dropout(rate = 0.3)(out)
43. predictions = Dense(101,kernel_regularizer=regularizers.l2(0), activation='softmax')(out)

44. model = Model(inputs=inception.input, outputs=predictions)
45. model.load_weights('best_model6.hdf5')
46. #experiment resulted in SGD being the better optimizer for this dataset
47. opt1=SGD(lr=INIT_LR, momentum=0.9,nesterov=False)
48. opt=Adam(lr=INIT_LR,beta_1=0.9,beta_2=0.999,epsilon=1e-07)
49. for layer in inception.layers:
50.     layer.trainable= False
51. model.compile(optimizer=opt1, loss='categorical_crossentropy',metrics=['accuracy'])

Line 38 loads the inception model with imagenet weights, to begin with, and include_top argument refers to the exclusion of the final layers as the model predicted 1000 classes, and we only have 101 classes.

Augment the Images

52. train_datagen = ImageDataGenerator(rescale = 1.0/255.0,                                 shear_range = 0.2, zoom_range = 0.2, horizontal_flip = True,                     fill_mode="nearest", width_shift_range=0.3,                               
height_shift_range=0.3, rotation_range=30,                                   samplewise_center=True,                                  

53. test_datagen = ImageDataGenerator( rescale = 1.0/255.0)

54. val_datagen = ImageDataGenerator( rescale = 1.0/255.0)

#for mean subtraction, in RGB order, let's set the means values

55. mean = np.array([123.68, 116.779, 103.939], dtype="float32")
56. train_datagen.mean = mean
57. val_datagen.mean = mean

58. training_set = train_datagen.flow_from_directory('./train', target_size=(img_height, img_width),    batch_size=batch_size, class_mode='categorical') # set as training data

59. validation_set = val_datagen.flow_from_directory('./val', # same directory as training data    target_size=(img_height, img_width),batch_size=batch_size,class_mode='categorical')
60. # set as validation data

61. test_set = test_datagen.flow_from_directory('./test',                                         target_size = (img_height, img_width),batch_size = batch_size,                                       class_mode = 'categorical')

Line 52 creates an ImageDataGenerator object, which is used to directly obtain images from a directory. It performs various operations on all the images in the directory mentioned. The operations mentioned here are normalisation, which is mentioned as the argument rescale = 1.0/255.0. The augmentation is done because CNNs are spatially invariant. If we rotate an image and send it to the network for prediction, the chances of mis-classification are high as the network hasn’t learned that during the training phase. Hence, augmentation leads to a better generalisation in learning. 

Line 53 and 54 similarly create ImageDataGenerator objects for loading images from test and validation directories, respectively. 

In lines 55-57, we specify the mean for the model which is used for the pre-processing of images. Mean-subtraction ensures that the model learns better. In Lines58-61, we load the data into respective variables. The next step is to find the ideal learning rate.

Transfer Learning: Type 1

Let’s find the initial learning rate 

62. print("Finding learning rate...")
63. lrf = LearningRateFinder(model)
64. lrf.find(validation_set,1e-10, 1e+1,stepsPerEpoch=validation_set.samples // float(config.BATCH_SIZE), batchSize=config.BATCH_SIZE)

# plot the loss for the various learning rates and save the resulting plot to disk

# gracefully exit the script so we can adjust our learning rates
# in the config and then train the network for our full set of
# epochs

65. print("Learning rate finder complete")
66. print("Examine plot and adjust learning rates before training")

67. INIT_LR = 5e-2
68. NUM_EPOCHS = 50
69. stepSize = config.STEP_SIZE * (validation_set.samples // batch_size)

70. # callbacks
71. def poly_decay(epoch):
72.    maxEpochs=400
73.    baseLR = 0.05
74.    power= 1.0
75.    alpha= baseLR *(1- (epoch / float(maxEpochs)))**power
76.    print(alpha)
77.    return alpha

78. def poly_decay1(epoch):
79.    if(epoch

Model checkpoint refers to saving model after each round of training.

90. cp_sanity = ModelCheckpoint(filepath='saved_model_sanity.hdf5', verbose=1, save_best_only=False)
91. lr = LearningRateScheduler(poly_decay)
92. es = EarlyStopping(monitor='val_loss', min_delta=0, patience=2, verbose=0, mode='auto', baseline=None, restore_best_weights=False)

Early stopping is a technique to stop training if the decrease in loss value is negligible. We wait for a certain patience period, and then if the loss doesn’t decrease, we stop the training process.

93. log = CSVLogger("logfile.log")


The above snippet of code deals with the learning rate scheduling. Let’s talk about Learning Rate Scheduling:

Learning Rate Scheduling

Learning rate scheduling refers to making the learning rate adapt to the change in the loss values. Usually, the loss decreases its value until a certain epoch, when it stagnates. This is because the learning rate at that instant is very large comparatively, and thus, the optimisation isn’t able to reach the global optimum. Hence, the learning rate needs to be decreased. This tuning of the learning rate is necessary to get the lowest error percentage.

We have experimented with three types of learning rate scheduling techniques:

  1. Polynomial decay
  2. Step decay
  3. Cyclical learning rate scheduler

Polynomial decay, as the name suggests, decays the learning rate or step size polynomially, and step decay is decayed uniformly. Cyclical learning rate scheduler works by varying the learning rate between a minimum and a maximum range of values during the training process. It is to avoid local minimums. Usually, the cost functions are non-convex and it is desirable to get the global minimums.

We perform the same in Lines 62-88. To find the initial learning rate, we have used Adrian Rosebrock’s module from his tutorial on learning rate scheduling. For further insights into the topic, we suggest going through his blog on the same. 

Sanity Checks:

Overfit a tiny subset of data, to make sure the model fits the data, and make sure loss after first epoch is around -ln(1/n) as a safety metric. In this case n=101, hence, initial loss = 4.65

history = model.fit_generator(validation_set, steps_per_epoch = validation_set.samples // batch_size,                             epochs = 400,verbose=1, callbacks =[LearningRateScheduler(poly_decay),cp_sanity])

94. print(“Evaluation of network...")
95. #predictions = model.predict(testX, batch_size=32)
96. #print(classification_report(testY.argmax(axis=1),predictions.argmax(axis=1), target_names=lb.classes_))

# plot the training loss and accuracy
97. N = 400
99. plt.figure()
100. plt.plot(np.arange(0, N), history.history["loss"], label="train_loss")
101. plt.plot(np.arange(0, N), history.history["acc"], label="train_acc")
102. plt.title("Training Loss and Accuracy on Dataset")
103. plt.xlabel("Epoch #")
104. plt.ylabel("Loss/Accuracy")
105. plt.legend(loc="lower left")

Since the loss value is nearly zero for the validation set without any regularisation method, the model is suitable to be fitted to a larger dataset. Overfitting occurs in the latter case, which can be administered by the use of dropouts and regularisers in the ultimate and penultimate layers.

Transfer Learning: Type 2

106. for layer in model.layers[:249]:
107.   layer.trainable = False
108. for layer in model.layers[249:]:
109.   layer.trainable = True

As mentioned earlier, we are freezing the first few layers to ensure the number of trainable parameters are less.

# we need to re-compile the model for these modifications to take effect
# we use SGD with a low learning rate
#from keras.optimizers import SGD

110. opt1=SGD(lr=INIT_LR, momentum=0.9,nesterov=False)
111. opt=Adam(lr=INIT_LR,beta_1=0.9,beta_2=0.999,epsilon=1e-07)
112. model.load_weights("saved_model.hdf5")
113. model.compile(optimizer=opt1, loss='categorical_crossentropy',metrics=['accuracy'])

114. INIT_LR = 0.018
115. NUM_EPOCHS=100

116. def poly_decay1(epoch):
117.    maxEpochs=NUM_EPOCHS
118.    baseLR = INIT_LR
119.    power= 1.0

120.    alpha= baseLR *(1- (epoch / float(maxEpochs)))**power
121.    print(alpha)
122.    return alpha
123. #LearningRateScheduler(poly_decay) works
124. history = model.fit_generator(training_set, steps_per_epoch = training_set.samples // batch_size,  validation_data=validation_set,                   validation_steps=validation_set.samples // batch_size, epochs=100,                    
verbose=1, callbacks=[LearningRateScheduler(poly_decay1),log,cp])

Fit generator refers to model being trained and fit to the given dataset at hand.

125. N = 100
127. plt.figure()
128. plt.plot(np.arange(0, N), history.history["loss"], label="train_loss")
129. plt.plot(np.arange(0, N), history.history["val_loss"], label="val_loss")
130. plt.plot(np.arange(0, N), history.history["acc"], label="train_acc")
131. plt.plot(np.arange(0, N), history.history["val_acc"], label="val_acc")

132. plt.title("Training Loss and Accuracy on Dataset")
133. plt.xlabel("Epoch #")
134. plt.ylabel("Loss/Accuracy")
135. plt.legend(loc="lower left")

In lines 110-130 we re-defined our model because this time we have frozen the first few layers and then proceeded with training. Lines 131-141 check if the model is overfitting or not.

The figure shows that the training accuracy is high, whereas the validation accuracy is low. Thus, applying regularisation techniques is necessary to avoid overfitting. We apply dropout to manage the same. 

Transfer Learning: Type 3

Type 3 refers to the combination of both types of transfer learning, initially fine-tuning the entire network for a few epochs, and then freezing the top layers for next N number of epochs.

136. inception = InceptionV3(weights= None , include_top=False)
137. out = inception.output
138. out = GlobalMaxPooling2D()(out)
139. out = Dropout(rate = 0.25)(out)
140. out = Dense(128,activation='relu')(out)
141. out = Dropout(rate = 0.3)(out)

142. predictions = Dense(101, activation='softmax')(out)

143. model = Model(inputs=inception.input, outputs=predictions)
144. model.load_weights('saved_model.hdf5')
145. #experiment resulted in SGD being the better optimizer for this dataset
146. opt1=SGD(lr=INIT_LR, momentum=0.8,nesterov=False)
147. opt = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.01, amsgrad=False)

148. for layer in model.layers[:290]:
149.   layer.trainable = False
150. for layer in model.layers[290:]:
151.   layer.trainable = True

152. model.compile(optimizer=opt1, loss='categorical_crossentropy',metrics=['accuracy'])

Cyclical Learning Rate

During training, the validation loss did not decrease irrespective of the variation in the initial learning rate. Hence, the logical assumption that can be made is that the cost function must have hit a local minimum, and to get it out of there, we use cyclical learning rate which performed much better than before.

153. INIT_LR = 6.75e-4
154. NUM_EPOCHS=30
155. stepSize = config.STEP_SIZE * (training_set.samples // batch_size)
156. clr = CyclicLR(mode=config.CLR_METHOD,base_lr=9e-4, max_lr=5e-3,step_size=stepSize)

157. def poly_decay1(epoch):
158.    maxEpochs=NUM_EPOCHS
159.    baseLR = INIT_LR
160.    power= 1.0

161.    alpha= baseLR *(1- (epoch / float(maxEpochs)))**power
162.    print(alpha)
163.    return alpha
164. #LearningRateScheduler(poly_decay) works

165. history = model.fit_generator(training_set,
166.                    steps_per_epoch = training_set.samples // batch_size,
167.                    validation_data=validation_set,
168.                    validation_steps=validation_set.samples // batch_size,
169.                    epochs= NUM_EPOCHS,
170.                    verbose=1,
171.                    callbacks=[clr,cp])
172. #LearningRateScheduler(poly_decay1)

173. # plot the learning rate history
174. N = np.arange(0, len(clr.history["lr"]))
175. plt.figure()
176. plt.plot(N, clr.history["lr"])
177. plt.title("Cyclical Learning Rate (CLR)")
178. plt.xlabel("Training Iterations")
179. plt.ylabel("Learning Rate")


Type of Transfer Learning Used

  1. Type 1: Number of epochs: 180 epochs : Accuracy: 58.07 after 180 epochs
  2. Type 2: Number of epochs: 100 epochs : Accuracy : 58.62 after 100 epochs
  3. Type 3: Number of epochs: 150 epochs : Accuracy: 58.05 after 150 epochs

Thus, Type 2 is the most suitable type of transfer learning for this problem.

Optimiser Used

SGD with momentum update

Learning Rate Scheduling

  1. Polynomial Decay# works well initially
  2. Cyclical Learning Rate # used this finally

Model Selection

InceptionV3 – Used this model and decreased image size to 96*96*3

Transfer Learning Type

Combination of Type 1 and Type 2 models of transfer learning results in increasing the validation accuracy. The way to experiment with this would be to train the model with Type 1 for 50 epochs and then re-train with Type-2 transfer learning.

Number of neurons and Dropout values

b. 128 – number of neurons +0.25 – probability  #Used this combination, as others increased the number of parameters massively.

Some additional experiments that the user can do are try adding noise to images during the data augmentation phase to make the model independent of noise. 

We suggest the readers go through the entire article at least two times to get a thorough understanding of deep learning and computer vision, and the way it is implemented and used. We can go a step further and visualise the kernels to understand what is happening at a basic level. How are networks learning? The answer to that is: Kernels are smooth when the network has learned the classification right and are noisy and blurry when the classification learnt is wrong. We suggest the user figure out ways to visualise the kernels. It will add credibility and competence.

Please go through the entire series once, and then come back to this article, as it surely will get you a head start in computer vision, and we hope you gain the ability to understand and comprehend research papers in computer vision. 

If you wish to learn more about transfer learning and other computer vision concepts, upskill with Great Learning’s PG program in Artificial Intelligence and Machine Learning. If you want to only study machine learning concepts with a course of shorter duration, join Great Learning’s PG program in Machine Learning.

Source :