An Intuitive Explanation of Convolutional Neural Networks

Posted on August 11, 2016May 29, 2017 by ujjwalkarn

What are Convolutional Neural Networks and why are they important?

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.

Screen Shot 2017-05-28 at 11.41.55 PM.png

Figure 1: Source [1]

In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.

Screen Shot 2016-08-07 at 4.17.11 PM.png

Figure 2: Source [2]

ConvNets, therefore, are an important tool for most machine learning practitioners today. However, understanding ConvNets and learning to use them for the first time can sometimes be an intimidating experience. The primary purpose of this blog post is to develop an understanding of how Convolutional Neural Networks work on images.

If you are new to neural networks in general, I would recommend reading this short tutorial on Multi Layer Perceptrons to get an idea about how they work, before proceeding. Multi Layer Perceptrons are referred to as “Fully Connected Layers” in this post.

The LeNet Architecture (1990s)

LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988 [3]. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.

Below, we will develop an intuition of how the LeNet architecture learns to recognize images. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.

Screen Shot 2016-08-07 at 4.59.29 PM.png

Figure 3: A simple ConvNet. Source [5]

The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks). As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories. The sum of all probabilities in the output layer should be one (explained later in this post).

There are four main operations in the ConvNet shown in Figure 3 above:

Convolution
Non Linearity (ReLU)
Pooling or Sub Sampling
Classification (Fully Connected Layer)

These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets. We will try to understand the intuition behind each of these operations below.

An Image is a matrix of pixel values

Essentially, every image can be represented as a matrix of pixel values.

Figure 4: Every image is a matrix of pixel values. Source [6]

Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels – red, green and blue – you can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255.

A grayscale image, on the other hand, has just one channel. For the purpose of this post, we will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 – zero indicating black and 255 indicating white.

The Convolution Step

ConvNets derive their name from the “convolution” operator. The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Screen Shot 2016-07-24 at 11.25.13 PM

Also, consider another 3 x 3 matrix as shown below:

Screen Shot 2016-07-24 at 11.25.24 PM

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below: Convolution_schematic

Figure 5: The Convolution operation. The output matrix is called Convolved Feature or Feature Map. Source [7]

Take a moment to understand how the computation above is being done. We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.

In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature detectors from the original input image.

It is evident from the animation above that different values of the filter matrix will produce different Feature Maps for the same input image. As an example, consider the following input image:

In the table below, we can see the effects of convolution of the above image with different filters. As shown, we can perform operations such as Edge Detection, Sharpen and Blur just by changing the numeric values of our filter matrix before the convolution operation [8] – this means that different filters can detect different features from an image, for example edges, curves etc. More such examples are available in Section 8.2.4 here.

Screen Shot 2016-08-05 at 11.03.00 PM.png

Another good way to understand the Convolution operation is by looking at the animation in Figure 6 below:

Figure 6: The Convolution Operation. Source [9]

A filter (with red outline) slides over the input image (convolution operation) to produce a feature map. The convolution of another filter (with the green outline), over the same image gives a different feature map as shown. It is important to note that the Convolution operation captures the local dependencies in the original image. Also notice how these two different filters generate different feature maps from the same original image. Remember that the image and the two filters above are just numeric matrices as we have discussed above.

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.

The size of the Feature Map (Convolved Feature) is controlled by three parameters [4] that we need to decide before the convolution step is performed:

Depth: Depth corresponds to the number of filters we use for the convolution operation. In the network shown in Figure 7, we are performing convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. You can think of these three feature maps as stacked 2d matrices, so, the ‘depth’ of the feature map would be three.

Screen Shot 2016-08-10 at 3.42.35 AM

Figure 7

Stride: Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.

Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution. This has been explained clearly in [14].

Introducing Non Linearity (ReLU)

An additional operation called ReLU has been used after every Convolution operation in Figure 3 above. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:

Screen Shot 2016-08-10 at 2.23.48 AM.png

Figure 8: the ReLU operation

ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear (Convolution is a linear operation – element wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).

The ReLU operation can be understood clearly from Figure 9 below. It shows the ReLU operation applied to one of the feature maps obtained in Figure 6 above. The output feature map here is also referred to as the ‘Rectified’ feature map.

Screen Shot 2016-08-07 at 6.18.19 PM.png

Figure 9: ReLU operation. Source [10]

Other non linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.

The Pooling Step

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.

Figure 10 shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.

Screen Shot 2016-08-10 at 3.38.39 AM.png

Figure 10: Max Pooling. Source [4]

We slide our 2 x 2 window by 2 cells (also called ‘stride’) and take the maximum value in each region. As shown in Figure 10, this reduces the dimensionality of our feature map.

In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).

Screen Shot 2016-08-07 at 6.19.37 PM.png

Figure 11: Pooling applied to Rectified Feature Maps

Figure 12 shows the effect of Pooling on the Rectified Feature Map we received after the ReLU operation in Figure 9 above.

Screen Shot 2016-08-07 at 6.11.53 PM.png

Figure 12: Pooling. Source [10]

The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling

makes the input representations (feature dimension) smaller and more manageable
reduces the number of parameters and computations in the network, therefore, controlling overfitting [4]
makes the network invariant to small transformations, distortions and translations in the input image (a small distortion in input will not change the output of Pooling – since we take the maximum / average value in a local neighborhood).
helps us arrive at an almost scale invariant representation of our image (the exact term is “equivariant”). This is very powerful since we can detect objects in an image no matter where they are located (read [18] and [19] for details).

Story so far

Screen Shot 2016-08-08 at 2.26.09 AM.png

Figure 13

So far we have seen how Convolution, ReLU and Pooling work. It is important to understand that these layers are the basic building blocks of any CNN. As shown in Figure 13, we have two sets of Convolution, ReLU & Pooling layers – the 2nd Convolution layer performs convolution on the output of the first Pooling Layer using six filters to produce a total of six feature maps. ReLU is then applied individually on all of these six feature maps. We then perform Max Pooling operation separately on each of the six rectified feature maps.

Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].

The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.

Fully Connected Layer

The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post). The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. I recommend reading this post if you are unfamiliar with Multi Layer Perceptrons.

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)

Screen Shot 2016-08-06 at 12.34.02 AM.png

Figure 14: Fully Connected Layer -each node is connected to every other node in the adjacent layer

Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better [11].

The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

Putting it all together – Training using Backpropagation

As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classifier.

Note that in Figure 15 below, since the input image is a boat, the target probability is 1 for Boat class and 0 for other three classes, i.e.

Input Image = Boat
Target Vector = [0, 0, 1, 0]

Screen Shot 2016-08-07 at 9.15.21 PM.png

Figure 15: Training the ConvNet

The overall training process of the Convolution Network may be summarized as below:

Step1: We initialize all filters and parameters / weights with random values

Step2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
- Lets say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
- Since weights are randomly assigned for the first training example, output probabilities are also random.

Step3: Calculate the total error at the output layer (summation over all 4 classes)
- Total Error = ∑ ½ (target probability – output probability) ²

Step4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.
- The weights are adjusted in proportion to their contribution to the total error.
- When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
- This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.
- Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.

Step5: Repeat steps 2-4 with all images in the training set.

The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.

Note 1: The steps above have been oversimplified and mathematical details have been avoided to provide intuition into the training process. See [4] and [12] for a mathematical formulation and thorough understanding.

Note 2: In the example above we used two sets of alternating Convolution and Pooling layers. Please note however, that these operations can be repeated any number of times in a single ConvNet. In fact, some of the best performing ConvNets today have tens of Convolution and Pooling layers! Also, it is not necessary to have a Pooling layer after every Convolutional Layer. As can be seen in the Figure 16 below, we can have multiple Convolution + ReLU operations in succession before having a Pooling operation. Also notice how each layer of the ConvNet is visualized in the Figure 16 below.

Figure 16: Source [4]

Visualizing Convolutional Neural Networks

In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14]. This is demonstrated in Figure 17 below – these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).

Screen Shot 2016-08-10 at 12.58.30 PM.png

Figure 17: Learned features from a Convolutional Deep Belief Network. Source [21]

Adam Harley created amazing visualizations of a Convolutional Neural Network trained on the MNIST Database of handwritten digits [13]. I highly recommend playing around with it to understand details of how a CNN works.

We will see below how the network works for an input ‘8’. Note that the visualization in Figure 18 does not show the ReLU operation separately.

Figure 18: Visualizing a ConvNet trained on handwritten digits. Source [13]

The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image. As seen, using six different filters produces a feature map of depth six.

Convolutional Layer 1 is followed by Pooling Layer 1 that does 2 × 2 max pooling (with stride 2) separately over the six feature maps in Convolution Layer 1. You can move your mouse pointer over any pixel in the Pooling Layer and observe the 2 x 2 grid it forms in the previous Convolution Layer (demonstrated in Figure 19). You’ll notice that the pixel having the maximum value (the brightest one) in the 2 x 2 grid makes it to the Pooling layer.

Screen Shot 2016-08-06 at 12.45.35 PM.png

Figure 19: Visualizing the Pooling Operation. Source [13]

Pooling Layer 1 is followed by sixteen 5 × 5 (stride 1) convolutional filters that perform the convolution operation. This is followed by Pooling Layer 2 that does 2 × 2 max pooling (with stride 2). These two layers use the same concepts as described above.

We then have three fully-connected (FC) layers. There are:

120 neurons in the first FC layer
100 neurons in the second FC layer
10 neurons in the third FC layer corresponding to the 10 digits – also called the Output layer

Notice how in Figure 20, each of the 10 nodes in the output layer are connected to all 100 nodes in the 2nd Fully Connected layer (hence the name Fully Connected).

Also, note how the only bright node in the Output Layer corresponds to ‘8’ – this means that the network correctly classifies our handwritten digit (brighter node denotes that the output from it is higher, i.e. 8 has the highest probability among all other digits).

Figure 20: Visualizing the Filly Connected Layers. Source [13]

The 3d version of the same visualization is available here.

Other ConvNet Architectures

Convolutional Neural Networks have been around since early 1990s. We discussed the LeNet above which was one of the very first convolutional neural networks. Some other influential architectures are listed below [3] [4].

LeNet (1990s): Already covered in this article.

1990s to 2012: In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting.

AlexNet (2012) – In 2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work.

ZF Net (2013) – The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters.

GoogLeNet (2014) – The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M).

VGGNet (2014) – The runner-up in ILSVRC 2014 was the network that became known as the VGGNet. Its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance.

ResNets (2015) – Residual Network developed by Kaiming He (and others) was the winner of ILSVRC 2015. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016).

DenseNet (August 2016) – Recently published by Gao Huang (and others), the Densely Connected Convolutional Network has each layer directly connected to every other layer in a feed-forward fashion. The DenseNet has been shown to obtain significant improvements over previous state-of-the-art architectures on five highly competitive object recognition benchmark tasks. Check out the Torch implementation here.

Conclusion

In this post, I have tried to explain the main concepts behind Convolutional Neural Networks in simple terms. There are several details I have oversimplified / skipped, but hopefully this post gave you some intuition around how they work.

This post was originally inspired from Understanding Convolutional Neural Networks for NLP by Denny Britz (which I would recommend reading) and a number of explanations here are based on that post. For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.

All images and animations used in this post belong to their respective authors as listed in References section below.

References

karpathy/neuraltalk2: Efficient Image Captioning code in Torch, Examples
Shaoqing Ren, et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2015, arXiv:1506.01497
Neural Network Architectures, Eugenio Culurciello’s blog
CS231n Convolutional Neural Networks for Visual Recognition, Stanford
Clarifai / Technology
Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks
Feature extraction using convolution, Stanford
Wikipedia article on Kernel (image processing)
Deep Learning Methods for Vision, CVPR 2012 Tutorial
Neural Networks by Rob Fergus, Machine Learning Summer School 2015
What do the fully connected layers do in CNNs?
Convolutional Neural Networks, Andrew Gibiansky
A. W. Harley, “An Interactive Node-Link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015 (link). Demo
Understanding Convolutional Neural Networks for NLP
Backpropagation in Convolutional Neural Networks
A Beginner’s Guide To Understanding Convolutional Neural Networks
Vincent Dumoulin, et al, “A guide to convolution arithmetic for deep learning”, 2015, arXiv:1603.07285
What is the difference between deep learning and usual machine learning?
How is a convolutional neural network able to learn invariant features?
A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
Honglak Lee, et al, “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations” (link)

208 thoughts on “An Intuitive Explanation of Convolutional Neural Networks”

싸커모하매드이브라힘ibrhaim says:

August 12, 2016 at 1:13 am

just want to say wow…

LikeLiked by 1 person

Reply
Raman S says:

August 12, 2016 at 6:39 am

Great explanation.Thanks!

LikeLiked by 1 person

Reply
月夜 says:

August 12, 2016 at 1:57 pm

One of the best posts of CNN I have read.

LikeLiked by 1 person

Reply
Roy says:

August 12, 2016 at 2:46 pm

Great explanation!!
I have some thoughts about it but would love to hear your intuition: which parts of these principles apply and in what manner, to non-spatial input feature vector?

LikeLiked by 1 person

Reply
1. ujjwalkarn says:
  
  August 22, 2016 at 12:53 pm
  
  Thanks Roy! Could you clarify what do you mean by non-spatial input feature vector?
  
  LikeLike
  
  Reply
  1. Roy Abitbol says:
    
    August 25, 2016 at 3:08 am
    
    Well.. For example: A vector of x,y,z coordinates representing a recording of pen movements, over time.
    
    LikeLike
  2. ujjwalkarn says:
    
    August 28, 2016 at 6:12 pm
    
    I have not come across such vectors as an input to CNNs before, but will look out for similar examples and let you know. Thanks for asking!
    
    LikeLike
yoni says:

August 12, 2016 at 6:22 pm

amazing

LikeLiked by 2 people

Reply
Raman S says:

August 13, 2016 at 4:36 am

What would be an example of a filter or kernel in the 2nd convolution layer? Presumably, this filter would need to extract higher level features since the 1st layer has already performed edge extraction. What would such a matrix look like? Once again, thanks for you great work!

LikeLiked by 2 people

Reply
1. ujjwalkarn says:
  
  August 22, 2016 at 12:52 pm
  
  Hey Raman, you’re right. A filter in the second convolution layer can be considered a 3-dimensional matrix whose width, height and depth can be chosen (for example, a 5x5x3 matrix). Such a filter may combine information from multiple feature maps in the first pooling layer to produce a single feature map in the second convolution layer (depending upon its depth). Refer to Table 1 on Page 8 of the LeNet paper (http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) for an explanation of how each filter acts on a subset of six feature maps in pooling layer S2 to produce a feature map in convolution layer C3.
  
  LikeLiked by 1 person
  
  Reply
Jiaxin Gu says:

August 17, 2016 at 3:33 pm

Found a spelling mistake in headline ”Other ConvNet Archiectures” which should be ”Architectures” BTW It’s one of the best posts of CNN I have read!

LikeLiked by 1 person

Reply
Thiago Vieira (@tcvieira) says:

August 19, 2016 at 3:01 am

Best intro posts about CNN I have read. Thank you!

LikeLiked by 1 person

Reply
stephenlee (@stephenlee0) says:

August 25, 2016 at 12:01 am

Very helpful, thank you! Any plans to write an article about RNN like this?

LikeLiked by 1 person

Reply
1. ujjwalkarn says:
  
  August 28, 2016 at 5:56 pm
  
  Not immediately, but will try to write in near future, thanks!
  
  LikeLike
  
  Reply
Krzysiek says:

August 25, 2016 at 2:34 am

Finally I almost fully understand CNN idea. Thanks for best CNN intro I’ve ever read.
But I used “almost” cause one thing is not clear after read for me:
On fig 13 and fig 18 when it comes to second convolution layer – in fig 13 it is mentioned there are 6 filters and in fig 18 there is mentioned about 16 filters – why so the 2nd conv layer has 6(fig13) and 16(fig18) matrices when it was applied to layer after polling contained 3(fig13) and 8(fig13) matrices. Shouldn’t second layer has 6*3(fig13) and 8*16(fig18) ? or there is only one matrix from previous layer presented for readability?

LikeLiked by 1 person

Reply
1. ujjwalkarn says:
  
  August 28, 2016 at 5:45 pm
  
  In Figure 18, Convolution Layer 2 is formed by using 16 filters. Each such filter combines information from a subset of 6 feature maps in Pooling Layer 1 to produce a single feature map in Convolution Layer 2.
  
  See Table 1 on Page 8 of the LeNet paper (http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) and the explanation below it and let me know if that clarifies!
  
  2nd Convolution Layer in Figure 13 can be explained similarly.
  
  LikeLiked by 1 person
  
  Reply
Tanay Chowdhury (@tanay006) says:

August 25, 2016 at 8:57 pm

Nicely done! My feedback will be to provide a short visualized intro for fully connected network as well,like pooling and RELU in this article itself.Also having numbered indexes like [1],[2] as hyperlinked will help the viewer.

LikeLiked by 1 person

Reply
1. ujjwalkarn says:
  
  August 28, 2016 at 5:48 pm
  
  Thanks!
  
  LikeLike
  
  Reply
Okiriza says:

August 26, 2016 at 6:13 am

excellent explanation, thanks!

LikeLiked by 1 person

Reply
L. says:

August 26, 2016 at 10:05 am

Awesome post. Quick question:

In the final example, how are the 16 filters applied to Pooling Layer 1 (with depth 6) to generate a feature map with depth 16? Why is the depth of the generated feature map not 96 (= 16 filters * depth 6)?

LikeLike

Reply
1. ujjwalkarn says:
  
  August 28, 2016 at 5:47 pm
  
  Each of the 16 filters has an associated depth and is applied on a subset of feature maps in Pooling Layer 1 to produce a single feature map in Convolution Layer 2.
  
  See Table 1 on Page 8 of the LeNet paper (http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) and the explanation below it. Let me know if that clarifies!
  
  LikeLike
  
  Reply
  1. L. says:
    
    August 30, 2016 at 9:26 am
    
    This helps a lot. Thanks so much!
    
    LikeLike
Zoze says:

September 5, 2016 at 9:36 am

Hi – I like your nice, thorough explanation, but, I don’t understand shared weights. I mean, do they sum during activation, and then absorb the average correction during backpropagation? I would find a similarly slow-motion explanation of those to be helpful 🙂

LikeLike

Reply
Damon says:

September 19, 2016 at 8:37 pm

Amazing! Thank you SO MUCH!

LikeLike

Reply
Walid Ahmed says:

September 20, 2016 at 1:17 am

Thanks a lot for the simple -yet efficient- explanation
I just have one question, since we are using 3 convolution filters, would not the number of features map from the second layer be 3*3=9

LikeLike

Reply
JuiLang Chu says:

September 24, 2016 at 7:23 am

Very cool and intuitive explanation for CNN. Thank you for your suggestion about the reference resource.

LikeLike

Reply
dovahkiin211 says:

October 14, 2016 at 6:42 pm

Like the others said, one of the most lucid and intelligible articles on CNN I have ever read. Took my interest in Deep Learning to a whole another level. Thank you so much for writing this.
I have a question I am having difficulties figuring out the answer to : What exactly is the purpose of Zero-padding? I got the high level explanation that it leads to Wide and Narrow Convolution, but what precisely does “wide convolution” and “narrow convolution” mean? Some concrete math or a small intuitive reasoning would be very helpful.

Thanks again for the article.

LikeLike

Reply
1. pranavkgPranav says:
  
  March 10, 2017 at 2:00 pm
  
  Padding is used for avoiding rapid decrease in the size of resulting activation maps across sucessive layers due to repeated application of convolution filters. This in turn facilitates having more hidden layers in your network(deeper network).
  
  LikeLike
  
  Reply
Rudra Ranajee Saha says:

November 3, 2016 at 11:17 pm

Amazing article on Convnet. One thing I don’t understand is although I know how to update weight values in multi layer neural network, how do I update filter matrix values in Convnet during back propagation ?

LikeLike

Reply
Ganes Kesari says:

November 12, 2016 at 7:25 pm

An amazing post with relevant examples and complete references. Very helpful as an introduction to CNNs. Thank you!

LikeLike

Reply
1. Sushma Bhat says:
  
  March 17, 2017 at 8:46 pm
  
  Thank you for explaining in simple terms abt CNN. Great article
  
  LikeLike
  
  Reply
Håkon Tjeldnes says:

November 18, 2016 at 9:33 pm

Best CNN explanation so far!

LikeLike

Reply
Ashraf says:

November 24, 2016 at 12:00 pm

That was simply the best explanation I have been through. Thank a lot!!

LikeLike

Reply
Dipti says:

November 28, 2016 at 6:12 am

Best post about CNN I have ever read. Thanks a lot!

LikeLike

Reply
Alek says:

November 29, 2016 at 3:03 am

Very useful, thank you!

LikeLike

Reply
Mohamed Amine BERGACH says:

November 29, 2016 at 8:20 am

Very nice article, Figure 5 doesn’t seems right to me, a convolution will produce and image of 5×5 not 3×3. this is something else.

LikeLike

Reply
rodingo says:

November 29, 2016 at 8:24 am

Very nice and clear article, Figure 5 doesn’t seems right to me: a convolution will produce a 5×5 image not 3×3.
Do you agree ?

LikeLiked by 1 person

Reply
capcase says:

December 8, 2016 at 3:57 am

Awesome post. Well explained for easy understanding: clear and crisp

I am looking forward to similar tutorial about NLP and sentiment analysis.

LikeLike

Reply
praveen says:

December 8, 2016 at 11:53 am

Thank you for simplifying the concept on CNN. I am new to machine learning. I have one question.
In simple neural networks there is a problem of vanishing gradient descent. So, if we make deep network architecture for recognizing many types of objects (as an example 10000 objects) we cannot do it by deep neural network because of vanishing gradient descent problem. So people started using autoencoder which tries to learn features in the pre-training as unsupervised learning. Then in fine-tuning, as supervised learning with deep stacked autoencoder it performs well. From here, why people moved to CNN? what is the logic behind it i could not understand. Could you kindly clarify on it?

LikeLike

Reply
Dan says:

December 9, 2016 at 9:07 pm

Hi,
Very good introduction to someone like me who just finished learning how a classical neural network works and knows nothing about CNN.

There is something I did not understand: After pooling layer 1 we get 6 images of size 14×14. Then we convolve those with 16 filters. I initially thought that the number of images now would be 6*16 = 96 images, but I see in the visualization (in the link you gave) that the output are 16 images. How come ?
1. How does this convolution work (in layer one we only had 1 input image, so that was easy, but in layer 2 I got lost) ?
2. Can you please explain the process how did you get to the result that there are:
120 neurons in the first FC layer
100 neurons in the second FC layer
10 neurons in the third FC layer corresponding to the 10 digits – also called the Output layer

LikeLike

Reply
Off Chanchana Sornsoontorn says:

December 10, 2016 at 3:56 pm

From your post, you have a 3 color channels picture. But you explain like you have a grayscale image.
Do you convolve all 3 channels with the same kernel? If yes, one feature map should have 3 channels, isn’t it?
I’ve seen this explanation left out and I’m really curious.

One more question, in edge detector kernel, [0 1 0; 1 -4 1; 0 1 0], what if the output is outside of the color range? If that -4 is multiplied by a white color, it could have been huge negative.

LikeLiked by 1 person

Reply
Emin Onur Karakaşlar says:

December 11, 2016 at 5:18 pm

Hello,

Great summary of CNN. Thank you.

If I understood it correctly, what a CNN does is to learn the parameters of filters via backpropagation and arrange the filters so that the error is minimized. Is this correct?

LikeLike

Reply
tpapastylianou says:

December 12, 2016 at 12:06 am

What an amazing, clear, enjoyable explanation. It was such a pleasure to read. Thank you very much.

LikeLike

Reply
Allen Akhaumere says:

December 22, 2016 at 11:19 pm

Well written and intuitively explain for any beginners in convNet.

LikeLike

Reply
Sinha P K says:

December 30, 2016 at 10:09 pm

Great and very simplified explanation on CNN.. Please continue to write your research papers on such latest topics to enlighten beginners.

LikeLike

Reply
@DrCawley says:

January 7, 2017 at 5:59 am

A really great primer. Thank you for taking the time to put this up.

LikeLike

Reply
Ke Xu says:

January 8, 2017 at 9:10 am

Thanks very much ! After reading several posts and watching couple of videos, I eventually found the most gentle introduction to CNN !

LikeLike

Reply
agmotif says:

January 19, 2017 at 1:00 am

Great Explanation! Thank you!
Have a question regarding the object location, how do you detect the location and size of the bounding box around the detected object? (As shown in figure 2)

LikeLike

Reply
Kang says:

January 22, 2017 at 9:54 am

amazing

LikeLike

Reply
bibhas2 says:

January 29, 2017 at 11:40 pm

CNN finally made sense. I knew regular NN, relu etc. well. But every explanation of CNN went right into convolution and pooling without explaining what they were. You have done a great job here. If you write a book on NN using this approach I will buy many copies and give one to everyone I know.

LikeLike

Reply
Gautam says:

February 7, 2017 at 3:58 am

This is just the best post/explanation I have come across CNN. Thanks! Great work!

LikeLike

Reply
Srinidhi says:

February 10, 2017 at 1:05 pm

Awestruck by your simplification of details.The best article read so far on CNN.Please do write one on RNN

LikeLike

Reply
Melvyn Ian Drag says:

February 14, 2017 at 8:16 am

After the final pooling step you have a set of 2d data and input that set into the neural net. Would you please sketch that process?

LikeLike

Reply
Dai says:

February 17, 2017 at 1:41 am

This is by far the best explanation / visualization on CNN I’ve seen. Great job.

LikeLike

Reply
Someone Somewhere says:

February 18, 2017 at 8:32 pm

I am currently studying a MS degree of AI in the Polytechnic University of Valencia and the teachers that teach all this of ANN, CovNets, etc. are so incompetent compared to this post. This is excepcional! If only my teachers were as half as determined and didactic as you and this post is, students would be infinitely more interested in this lovely research field. It’s absolutely amazing. Keep up the good work

LikeLike

Reply
vivek says:

February 20, 2017 at 6:17 pm

Awesome work, great introduction to layman. I am curious to know how do people generally decide upon the number of hidden layers required, number of neurons needed for each layer etc. Coming to filters, Do we have some common standard filters used for convolution in all kinds of object recognition tasks or they change based on what we need to classify. It would be helpful if you can give more intuition or heuristics on these things . Thanks for the great post.

LikeLike

Reply
wjzholmes says:

February 23, 2017 at 4:38 pm

thanks！

LikeLike

Reply
anonymousnomad says:

March 3, 2017 at 1:25 am

Excellent article, ideal for a beginner ! Thanks a lot ..

LikeLike

Reply
Shike Feng says:

March 6, 2017 at 4:35 am

Excellent explanation, thanks a lot

LikeLike

Reply
pranavkg says:

March 10, 2017 at 3:11 pm

Hi,

Excellent explanation!!

I would like to know in which context you are referring them as cheap in the following statement?
“..adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features.”

As far as I understand they actually increase the total number of parameters for the network.

LikeLike

Reply
Navreet says:

March 14, 2017 at 2:29 pm

Hi..firstly a very good explanation. My doubt is after finding the error for first training image and back propagating for weight update. What happens next?
1. is a new image fed and trained with the updated weights of previous image oR
2. the same image is fed and weight updation occurs.

For all training images the same weights are used or different?

LikeLike

Reply