Quick Introduction to Convolutional Neural Networks (CNN) | by Salar

Convolution Operation

In mathematics, convolution is a mathematical operation that combines two functions to produce a third function that represents how one of the original functions modifies the other. It is defined as follows:

Here, f(t) and g(t) are two functions that we want to convolve, and f * g is the resulting convolution function. The symbol * denotes the convolution operation, and the integral sign represents integration over the entire real line.

In the context of image processing, we can think of f as the image we want to analyze, and g as a filter or kernel that we want to use to extract features from the image. The resulting convolution function represents how the filter modifies the image at each point, and can be used to highlight certain features or structures in the image. For example we can see an illustration of this:

In a CNN, convolution is used as a building block to analyze images. The network is typically composed of several layers, each of which applies a convolution operation followed by a non-linear activation function. The output of each layer is then passed to the next layer, and so on, until the final output is produced.

More formally, let’s say we have an input image X with dimensions W x H x C, where W and H are the width and height of the image, and C is the number of channels. A convolutional layer in a CNN can be represented as a function f, which takes the input image X and produces an output feature map Y with dimensions W’ x H’ x K , where W’ and H’ are the dimensions of the output feature map, and K is the number of filters.

In a CNN, convolutional layers are used to extract features from an input image. These layers are composed of several filters, each of which is responsible for detecting a specific feature in the image. For example, one filter might be designed to detect edges, while another might be designed to detect corners. For example, we can see how we detect different features of a handwritten number:

Pooling Layer

After applying the convolutional layer, the output is typically passed through a pooling layer. The purpose of the pooling layer is to reduce the dimensionality of the feature maps, while preserving the important features. Pooling is done by dividing each feature map into non-overlapping regions and computing a summary statistic, such as the maximum or average, for each region.

For example, in max pooling, the maximum value in each region is selected as the representative value for that region. The output of the pooling layer is a set of smaller feature maps, which are fed into the next convolutional layer. The process of applying convolution and pooling layers is repeated multiple times, resulting in a hierarchical set of features that captures increasingly complex patterns in the input image.

Fully Connected Layer

The final layer of a CNN is typically a fully connected layer, which takes the output of the convolutional and pooling layers and produces a set of probabilities, indicating the likelihood of the input image belonging to each class in the classification task. The operation of a fully connected layer can be represented by the following formula:

Here, X is the input vector, W is the weight matrix, b is the bias vector, and σ, is the activation function. The output of this operation is the vector Y, which contains the probabilities for each class.

Conclusion

In summary, a CNN is a type of neural network that is particularly good at image classification tasks. It uses convolutional layers to extract features from an input image, and pooling layers to reduce the dimensionality of the feature maps. The output of the final layer is a set of probabilities, indicating the likelihood of the input image belonging to each class in the classification task.