AlexNet: The First CNN to win Image Net

This article is a AlexNet Tutorial which is focused on exploring AlexNet which became one of the most popular CNN architectures.

History of AlexNet

AlexNet was primarily designed by Alex Krizhevsky. It was published with Ilya Sutskever and Krizhevsky’s doctoral advisor Geoffrey Hinton, and is a Convolutional Neural Network or CNN. Learn more about it in this CNN Course.

After competing in ImageNet Large Scale Visual Recognition Challenge, AlexNet shot to fame. It achieved a top-5 error of 15.3%. This was 10.8% lower than that of runner up. 

The primary result of the original paper was that the depth of the model was absolutely required for its high performance. This was quite expensive computationally but was made feasible due to GPUs or Graphical Processing Units, during training.

CNN Architectures

Before exploring AlexNet it is essential to understand what is a convolutional neural network. Convolutional neural networks are one of the variants of neural networks where hidden layers consist of convolutional layers, pooling layers, fully connected layers, and normalization layers. 

Convolution is the process of applying a filter over an image or signal to modify it. Now what is pooling? It is a sample-based discretization process. The main reason is to reduce the dimensionality of the input. Thus, allowing assumptions to be made about the features contained in the sub-regions binned. 

A stack of distinct layers that transform input volume into output volume with the help of a differentiable function is known as CNN Architecture. (e.g. holding the class scores)

In other words, one can understand a CNN architecture to be a specific arrangement of the above-mentioned layers. Numerous variations of such arrangements have developed over the years resulting in several CNN architectures. The most common amongst them are:

1. LeNet-5 (1998)

2. AlexNet (2012)

3. ZFNet (2013)

4. GoogleNet / Inception(2014)

5. VGGNet (2014)

6. ResNet (2015) 

AlexNet Architecture

AlexNet was the first convolutional network which used GPU to boost performance. 

1.      AlexNet architecture consists of 5 convolutional layers, 3 max-pooling layers, 2 normalization layers, 2 fully connected layers, and 1 softmax layer. 

2.      Each convolutional layer consists of convolutional filters and a nonlinear activation function ReLU. 

3.      The pooling layers are used to perform max pooling. 

4.      Input size is fixed due to the presence of fully connected layers.

5.      The input size is mentioned at most of the places as 224x224x3 but due to some padding which happens it works out to be 227x227x3 

6.      AlexNet overall has 60 million parameters.

Model Details 

The model which won the competition was tuned with specific details-

1. ReLU is an activation function 

2. Used Normalization layers which are not common anymore 

3. Batch size of 128 

4. SGD Momentum as learning algorithm 

5. Heavy Data Augmentation with things like flipping, jittering, cropping, color normalization, etc. 

6. Ensembling of models to get the best results. 

AlexNet was trained on a GTX 580 GPU with only 3 GB of memory which couldn’t fit the entire network. So the network was split across 2 GPUs, with half of the neurons(feature maps) on each GPU. 

This is the reason one can see a split in the architecture diagram. 

Key Features

Overlapping Max Pooling

To down-sample an image or a representation, Max Pool is used. It reduces its dimensionality by allowing assumptions to be made about features contained in the sub-regions binned. 

Overlapping Max Pool layers are similar to Max Pool layers except the adjacent windows over which the max is calculated overlaps each other. The authors of AlexNet used pooling windows, sized 3×3 with a stride of 2 between the adjacent windows. Due to this overlapping nature of Max Pool, the top-1 error rate was reduced by 0.4% and top-5 error rate was reduced by 0.3% respectively. If you compare this to using a non-overlapping pooling windows of size 2×2 with a stride of 2, that would give the same output dimensions.

ReLU Nonlinearity 

Using ReLU non-linearity, AlexNet shows us that deep CNN’s can be trained much faster with the help of saturating activation functions such as Tanh or Sigmoid. The figure shown below shows us that with the help of ReLUs(solid curve), AlexNet can achieve a 25% training error rate. This is six times faster than an equivalent network that uses tanh(dotted curve). This was tested on the CIFAR-10 dataset.

Data Augmentation 

When you show a Neural Net different variation of the same image, it helps prevent overfitting. It also forces the Neural Net to memorize the key features and helps in generating additional data. 

Data Augmentation by Mirroring

Let’s say we have an image of a cat in our training set. The mirror image is also a valid image of a cat. This mean that we can double the size of the training datasets by simply flipping the image above the vertical axis.

Data Augmentation by Random Crops

Also, cropping the original image randomly will lead to additional data that is just a shifted version of the original data.

The authors of AlexNet extracted random crops sized 227×227 from inside the 256×256 image boundary, and used this as the network’s inputs. Using this method, they increased the size of the data by a factor of 2048.


During dropout, a neuron is dropped from the Neural Network with a probability of 0.5. When a neuron is dropped, it does not contribute to forward propagation or backward propagation. Every input goes through a different Neural Network architecture, as shown in the image below. As a result, the learned weight parameters are more robust and do not get overfitted easily.


In the 2010 version of ImageNet challenge AlexNet vastly outpaced the second-best model with 37.5% top -1 error vs 47.5% top-1 error , and 17.0% top-5 error to 37.55 top-5 error. AlexNet was able to recognize off-center objects and most of its top 5 classes for each image were reasonable. AlexNet won the 2012 competition with a top-5 error rate of 15.3% compared to second place top-5 error rate of 26.2%.

The success of AlexNet is mostly attributed to its ability to leverage GPU for training and being able to train these huge numbers of parameters. 

In the following layers, there were multiple improvements over AlexNet resulting in models like VGG, GoogleNet, and lately ResNet

Fun Facts 

Shortly after winning the ImageNet competition Alex Krizhevsky and Ilya Sutskever sold their startup DNN research Inc to Google. Alex worked in Google till 2017 when he left Google (citing loss of interest) to work at Dessa where he will advise and research new deep learning techniques. Ilya Sutskever left Google in 2015 to become director of the OpenAI Institute and is currently working at the same place. 

If you found this helpful and wish to learn more, upskill with Great Learning’s PGP- Artificial Intelligence and Machine Learning Course or Deep Learning Certification today.

Contributed by: Vibhor Nigam

Vibhor is currently working as a Sr. Data Scientist at Comcast, developing automated solutions using Machine Learning. He has worked on various projects utilizing AI and ML to build solutions across domains such as network health, customer experience, traffic forecasting and budgeting. Prior to joining Comcast he was pursuing graduate studies at University of Pennsylvania, and completed his MSE in Robotics. My areas of interest include Machine Learning, Artificial Intelligence and Programming.

Source :

Leave a Reply

Your email address will not be published. Required fields are marked *