Demystifying DCNNs — the AlexNet

with hands-on implementation of AlexNet architecture

7 min readOct 15, 2022

Introduction🧧

Remember the ImageNet visual recognition challenge of 2012?

Of course, you do! After a ton of trial and error, and experimentation, researcher Alex Krizhevsky and his co-authors Ilya Sutskever and Geoffrey E. Hinton ( who literally made sense of the “deep” in deep learning) introduced the AlexNet architecture, named after the first author, and won the ImageNet Challenge, thus bringing in a new era of deep transfer learning for object detection with Deep Convolutional Neural Networks!

In this blog, we will demystify the AlexNet paper and discuss in depth its architecture, followed by a hands-on implementation of what we learnt!

Without further ado, let’s begin.🤜🏼🤛🏼

Prerequisites⏮️

This article assumes that you have a very good knowledge of -

Convolution Operation
Max Pooling, Strides, Kernels, Padding
Neural Networks (CNN) & Activation Functions (Softmax, ReLU)
CNN ( Convolutional Neural Networks)

Revise Zone!🤔

Please skip this part if you are already well aware of the prerequisites

Let me attach some resources from where I studied the basics of Deep Learning, for your help.

Convolution Operation — Watch this
Convolutional Neural Networks — Watch this
For learning Activation Functions, head over to my blog

Cracking open the 8-Layered Beast 🤖

For an easier understanding of the complex architecture, I’ve drawn an easier-to-digest architecture of the model. Follow the architecture carefully throughout the blog!

AlexNet Architecture Simplified (Source-Author)

Let’s break it down and understand the architecture step by step-

From the architecture drawn above, observe the RGB image of 224 by 224 dimensions along with 3 channels (Red (R), Green (G), and Blue (B)). It represents any color image, and hence, our input.
Layer 1 — In the 1st layer, we apply a convolution operation of stride size 4, and have 96 kernels of size 11 x 11. These operations, when applied, result in 96 images of size 55 x 55 ( since our kernel had 96 channels). This is simply a 3D tensor.
Then we apply a max-pooling of kernel size 3x3 and stride size of 2. This results in an image of size 27x27 with 96 channels.
Layer 2 — In the 2nd layer of AlexNet, we again apply a convolution along with padding with 256 kernels of size 5x5. Notice I have mentioned “same” beside convolution in the diagram. SAME in padding means the filter will be applied to ALL the elements of the input, and padding should be set to SAME during the model training phase. This ensures that the output dimensions will be the same as the input dimensions. This results again in a 3D tensor of size 27 x 27 with 256 such channels.
Again, we apply a max-pooling layer of kernel size 3x3 and stride of size 2, as we did after layer 1. This results in an image of size 13x13 with 256 channels.
Layer 3 — In the 3rd layer of AlexNet, once again we apply a convolution with the SAME padding (as discussed in Layer 2) with 384 kernels of size 3x3. This results in a 3D tensor of size 13x13 with 384 channels.
Layer 4— In the 4th layer, we apply a convolution with the SAME padding with 384 kernels of size 3x3. This again results in a 3D tensor of size 13x13 with 384 channels.
Layer 5— In the 5th layer, we apply a convolution with the SAME padding with 256 kernels of size 3x3. This again results in a 3D tensor of size 13x13 with 256 channels.
Now we again apply a max-pool of stride size 2 and 3x3 max pool filter (kernel). This results in an image of size 6x6 with 256 channels.
Layers 6,7, 8 — On applying the flattening operation (simple product of the dimensions) on the 6x6x256 (= 9216) 3D tensor, we simply get 2 fully connected layers followed by our final softmax layer to output our classes. (note that the diagram has 1000 output classes, as it was used to classify the ImageNet problem, which had 1000 classes to be recognized)

So, basically, AlexNet has 5 Convolution layers followed by 2 fully connected layers and the final Softmax Layer for producing output.

Why AlexNet is so COOL?!

crunching the paper…

Let’s discuss some of the most important concepts extensively used in deep learning and object detection even today, that were introduced in the AlexNet paper —

AlexNet paper was the first one which used the concept of the massively powerful non-linear activation function ReLU.
Dropouts were used to prevent overfitting and ensure model robustness and generalization in learning.
Data Augmentation was used to transform data into various forms ( horizontal/vertical flipping, rotation, etc.) to enhance the diversification of training data.
Multiple GPUs were used to train the model for good performance measures.

Apart from these, the AlexNet paper used a concept called Local Response Normalization (LRN), which, though noteworthy as a concept, was further replaced with advanced techniques like Batch Normalization.

Personally, I do prefer batch normalization while solving deep learning problems, but the concept of LRN is so beautiful that I could not afford to skip it while studying the AlexNet paper.

LRN connects Neuroscience with AI 🔮🪄

We know that before using any activation functions, our neural network's hidden layers expect us to normalize the high dimensional data to zero mean and unit variance, to model the data correctly. Normalization transforms the values of columns in the dataset and scales them, without distorting differences in the ranges of values or losing information.

After using ReLU, which is f (x) = max (0, x), you will find that the value obtained after using ReLU has no range like the tanh and sigmoid functions. Hence, a norm has to be done after ReLU. The researchers mentioned a method called Local Response Normalization in the AlexNet paper, which seemed quite fascinating to me, as it is found to be motivated by an important concept in neuroscience called “Lateral inhibition”, which talks about the working of active neurons on its surrounding neurons with response to a particular stimulus.

Assuming that you already know the working of ReLU, the data should be normalized between 0 and 1 as anything <0 is 0 in ReLU theories. For an unbounded function (no maximum value) like ReLU, LRN is used to normalize those unbounded activations given out by ReLU.

You will be surprised to know that the concept of LRN in artificial neural networks was adapted from a neuroscience concept called Lateral inhibition.

Lateral Inhibition involves the suppression of distant neurons by exciting neurons.

In our Central Nervous System, excited or stimulated neurons tend to suppress the activity of distant neurons, which helps sharpen our sense perception (recall how you can still hazily recognize the surroundings while your focus is on a target object — you can still sense the surrounding trees while looking at a bird). Such sort of visual suppression, in turn, enhances the perception and enhances the contrast in visual images. It helps in improving our vision and sight while watching (as well as hearing, smelling, etc.)!

But, how LRN is associated with Lateral Inhibition?

Local Response Normalization is also a process involving contrast enhancement in input features ( feature maps) of convolutional neural nets. LCN performs in local neighborhoods of feature maps (images) taking into account every pixel.

The overall concept is to enhance “peak” and dampen “flat” responses on the input images or feature maps, as peaks are positively correlated to the presence of a target object or stimulus than a flat but high-frequency response that doesn’t give much data involving whether the object or stimuli is actually present. Hence, LRN increases neural sensory sensitivity to desirable stimuli.

Summing up, we can deduce that, if there is a strong correlation with the target object/ stimulus (peak), the emulation amongst neurons in that local patch (or neighborhood) of the image is such that the strong correlation will suppress the weak correlation, thereby enhancing the peak. But if there is a flat but strong correlation, then each of those strong correlations will suppress each other almost equally, such that the whole neighborhood will be damped.

The LRN process thus enhances object detection and recognition.

The LRN concept was quite overshadowed ( it had minimal impact) by the upcoming concepts of layer and batch normalization, which normalizes the entire layer of the feature map or creates mini-batches of feature maps and normalizes every batch, and works pretty well on almost every neural network so far.

With that, we pretty much crunched the research paper. Now let us get to the fun part.

Hands-On Implementation of AlexNet 👩🏼‍💻

Just by looking at the architecture, we can easily code the architecture of AlexNet. Below is the code for the same —

and there you go! The legendary AlexNet is at your fingertips!

References🔃

I have used the following resources with their links attached, for reference-

Conclusion🔚

Hope you had fun learning about AlexNet! If you would like me to cover any other neural network architecture or research paper, please let me know in the comments!

If you are a beginner in Data Science and Machine Learning and have some specific queries with regard to Data Science/ML-AI, guidance for Career Transition to Data Science, Interview/Resume Preparation or even want to get a Mock Interview before your D-Day, feel free to book a 1:1 call here. I will be happy to help!

Happy Learning!😎

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com