ProjectBlog: 9 Deep Learning Papers That You Must Know — Part 1

Blog: 9 Deep Learning Papers That You Must Know — Part 1

AlexNet — The paper that changed how we perform deep learning

LSVRC a.k.a. Large Scale Visual Recognition Challenge is a competition where research teams evaluate their algorithms on a huge dataset of labelled images (ImageNet), and compete to achieve higher accuracy on several visual recognition tasks.

AlexNet’s results for nearest neighbours. Leftmost column is the test image. Remaining columns are AlexNet’s guesses for nearest neighbours.

This competition has been going on since 2010 and happens every year. AlexNet is the name of a convolutional neural network that won the competition in year 2012. It was designed by Alex Krizhevsky, Ilya Sutskever and Krizhevsky’s PhD advisor Geoffrey Hinton. Geoffrey Hinton, the winner of this year’s $1M Turing award, was originally resistant to the idea of his student.

The popularity of this paper can be seen through the number of its citations alone

38K Citations

Top 7 cool things about this paper

  1. Depth — Layers : 8 (5 Convolutional + 3 Fully Connected), Parameters : 60 Million, Neurons : 650 Million
  2. Activation Function — Non-Linearity Used : ReLU instead of TanH
  3. Speed — GPUs : 2, Training Time : 6 days
  4. Contrast — Response Normalization
  5. Overfitting prevention — Data augmentation + Dropout instead of regularisation
  6. DATASET (ImageNet) — 1.2 million training images, 50,000 validation images, and 150,000 testing images.
  7. Huge winning margin — Test Error Rate of 15.3% vs 26.2% (second place)
Photo by David Travis on Unsplash

Let’s understand these cool things in detail. The paper employed a number of techniques that were unusual when it came to the state of the art, at that time. Let’s take a look at some of the differentiating features of this paper.

Layers — Depth is ‘uber’ important

The architecture contains eight layers with weights. The first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Softmax takes all the 1000 values, looks at the maximum value and makes it 1 and sets all others to 0.

Source Sik-Ho Tsang’s article

Connections between layers — Notice how right after the 1st layer we have two parallel paths that are exactly the same. Each path processes a part of the data parallely. Yet there are some computations that are shared e.g. second, fourth, and fifth convolutional layers are connected only to the same path while the third layer is cross-connected to both paths in the second layer.

This cross-connection scheme was a neat trick which reduced their top-1 and top-5 error rates by 1.7% and 1.2%, respectively. This was huge given that they were already ahead of the state of the art.

The depth of this network (number of layers) is so critical that removing any of the middle layer suddenly degrades the accuracy.

Discussion Section of the paper.


AlexNet is a Convolutional Neural Network. A neural network is bound to be made up of neurons. Biologically inspired neural networks possess something called an activation function. In simple terms, activation function decides whether the input stimulus is enough for a neuron to fire — i.e. get activated.

AlexNet team chose a non linear activation function with the non-linearity being a Rectified Linear Unit (ReLU). They claimed that it ran much faster than TanH the more popular choice for linearity at the time.

Why do we need a non-linear activation function in an artificial neural network?

Neural networks are used to implement complex functions, and non-linear activation functions enable them to approximate arbitrarily complex functions. Without the non-linearity introduced by the activation function, multiple layers of a neural network are equivalent to a single layer neural network.

The authors showed that ReLU achieved a low error rate much faster than TanH. The popular plot is shown below with the thick line representing ReLU and dashed line, the TanH function.

Plot from the AlexNet Paper

The keyword being faster. Above all AlexNet needed a faster training time and ReLU helped them. But they needed something more. Something that could transform the speed with which CNNs were computed. This is where the GPUs figured.

GPUs and Training Time

GPUs are devices that can perform parallel computations. Remember how an average laptop is either a Quadcore(4 cores) or an Octacore(8 cores). This refers to the number of parallel computations that can happen in a processor. A GPU can have 1000s of cores leading to a lot of parallelization. AlexNet made use of a GPU that NVIDIA launched a year before AlexNet came out.

The noticeable thing was that AlexNet made use of 2 GPUs in parallel which made their design extremely fast.

Despite the architecture it took AlexNet 6 days to train. But training time was not the only concern. Your accuracy would go bust if your data is not normalized. AlexNet needed to perform some efficient way of normalizing their data. They chose LRN.

Local Response Normalization

In neurobiology, there is a concept called “lateral inhibition”. This refers to the capacity of an excited neuron to subdue its neighbors. The neuron does that to increase the contrast in its surroundings, thereby increasing the sensory perception for that particular are. Local response normalization (LRN) is the computer science way of achieving the same thing.

AlexNet employed LRN to aid generalization. Response normalization reduced their top-1 and top-5 error rates by 1.4% and 1.2%, respectively.

Lateral Inhibtion in action: The two blocks are of same color. Put a finger across the separating line and see for yourself.

Every CNN has pooling as an essential step. Up until 2012 most pooling schemes involved non-overlapping pools of pixels. AlexNet was ready to experiment with this part of the process.

Overlapping Pooling

Pooling is the process of picking a neighborhood of s x s pixels and summarizing it.

Summarizing can be

  • A simple average of all pixel values or
  • Majority vote or even,
  • A median across the patch of s x s pixels.

Traditionally, these patches were non-overlapping i.e. once an s x s patch is summarized you don’t touch these pixels again and move on to the next s x s patch. They realized that overlapping pooling reduced the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme.

Overlapped Pooling — Source:

Having tackled normalization and pooling AlexNet was faced with a huge overfitting challenge. Their 60-million parameter model was bound to overfit. They needed to come up with an overfitting prevention strategy that could work at this scale.

Overfitting Prevention

Whenever a system has huge number of parameters, it becomes prone to overfitting. Overfitting is when a model completely adapts itself so religiously to the training data that it fails horribly on test data. That is equivalent of you memorizing all answers in your maths book and failing to understand the formulae behind those answers.

Given a question that you’ve already seen you can answer perfectly but you’ll perform poorly on unseen questions.

More the parameters higher the chances for overfitting. Source

With an architecture containing 60 million parameters AlexNet faced a considerable amount of overfitting.

They employed two methods to battle overfitting

  • Data Augmentation
  • Dropout

Data Augmentation

Data augmentation is increasing the size of your dataset by creating transforms of each image in your dataset. These transforms can be simple scaling of size or reflection or rotation.

Data augmentation of an image of number 6. Source

These schemes led to an error reduction of 1% in their top-1 error metric. By augmenting the data you not only increase the dataset but the model tries to become rotation invariant, color invariant etc. and prevents overfitting


The second technique that AlexNet used to avoid overfitting was dropout. It consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back- propagation. So every time an input is presented, the neural network samples a different architecture.

This new-architecture-everytime is akin to using multiple architectures without expending additional resources. The model, therefore, forced to learn more robust features.

Dropout in action. Source


Finally how can we show the magnificence of AlexNet without showing the challenge it faced. ImageNet has in totality 15 million labeled high-resolution images in over 22,000 categories. ILSVRC, the competition, uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

The hierarchy of classification of images in ImageNet

The 2012 Challenge

AlexNet won the ILSVRC. This was a major breakthrough. Let’s look at what the ask was and what was delivered.

Left — Challenge statement | Right — Data Example

AlexNet also released how feature extraction looked after each layer. This data is available in their supplementary material.

Prof.Hinton who won the Turing’s award this year was apparently not convinced by Alex’s proposed solution. The success of AlexNet goes on to show that with enough grit and determination, innovation does find its way to success.

For a deeper dive into some of the topics mentioned above I have listed various resources I found extremely helpful.

Deep Dive

  • Hao Gao has put the details of this architecture in a table. The size of the network can be estimated from the fact that it has 62.3 million parameters, and needs 1.1 billion computation units in a forward pass.

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.