Convolutional Neural Networks - Deep Dive

Convolutional Neural Networks (CNNs) use layers specifically designed for image data. These layers capture local relationships between nearby features in an image.

Previously, in our feed-forward model, we multiplied our normalized pixels by a large weight matrix (of shape (65536, 100)) to generate our next set of features. However, when we use a convolutional layer, we learn a set of smaller weight tensors, called filters (also known as kernels). We move each of these filters (i.e. convolve them) across the height and width of our input, to generate a new "image" of features. Each new "pixel" results from applying the filter to that location in the original image.

In a convolutional neural network, feature maps are the result of convolving a single filter across our input, and they provide a way to visualize a model's internal workings. They allow us to see how our network responds to a particular image in ways that are not always apparent when we only examine the raw filter weights.

watch this video to understand how kernels are used to extract features (like edges).

Configuring a Convolutional Layer - Stride and Padding

watch these video to understand the concepts of padding and stride in Convolution.

padding:

stride:

Convolution can not only be used in 2D, but in higher dimensions as well, watch this video to see how convolution can be applied in 3D.

Configuring a simple convolution Network

watch this video to see how the concepts above are applied to design and build a conv. net

Pooling Layers

Pooling layers are used to reduce the size of representations to speed the computation and improve robustness of features detection. watch this video to know more about how pooling layers work in CNN.

[Optional] - For more reference to Convolution arithmetic you can read Chapter 1,2 and 3 from this paper

CNN

Watch this video to understand how all the concepts above come together with Fully connected neural network.

Why do convolution-based approaches work well for image data?

Convolution can reduce the size of an input image using only a few parameters. Filters compute new features by only combining features that are near each other in the image. This operation encourages the model to look for local patterns (e.g., edges and objects).

Convolutional layers will produce similar outputs even when the objects in an image are translated (For example, if there was a cat in the bottom or top of the frame). This is because the same filters are applied across the entire image.

Before deep nets, researchers in computer vision would hand design these filters to capture specific information. For example, a 3x3 filter could be hard-coded to activate when convolved over pixels along a vertical or horizontal edge:

watch this video to get more detail understanding of why convolution ?