![](https://crypto4nerd.com/wp-content/uploads/2024/04/1Zt-6kLyfRUraBsS-Iy2E2Q-1024x723.png)
Understanding our journey from logistic regression through MLP, CNN, RNN, LSTM, Transformer, LLM to AGI(someday, maybe)
In the last part we have seen how we can capture the pattern between input and output using mathematical equations and understand its limitations with unstructured data. Here we will see how neural networks can adjust their matrix weights during training to capture the patterns in unstructured data and predict the next output with high accuracy. We will also see how we can capture the spatial patterns of images by performing mathematical operations called convolution on the matrices of a deep neural network — our first hint of working with unstructured data. We will see how connecting the output back to the input and working with sequences in time steps helps create ‘memory’ which can be used to capture sequential patterns like text. Issues like vanishing and exploding gradients notwithstanding, we will see how we can overcome them by adding choice into this memory in LSTMs, allowing the network to forget things that are not important solving the issues faced by RNNs.
So far we have looked at traditional machine learning, how it is possible to model the pattern between input and output data and then use historical data for training and parameter optimisation to accurately predict the next output. We have looked at different kinds of output — continuous, binary & categorical but our data was always structured.
How do you deal with data that is unstructured? Sure you can capture the essence of images, videos, audio files by encoding them to 1s and 0s but knowing that timestamp ‘t’ has frequency ‘f’ will not help us identify if its a taylor swift song. Traditional machine learning algorithms depend on the input data to be informative , not spatially dependent. Dealing with this kind of data requires an entirely new approach — Deep Learning!
A Basic Neural Network: Deep learning is the process of creating and training neural networks to perform a specific task. A Multi Layered Perceptron(MLP) is the most basic form of a neural network. Lets break it down to its elements. Lets start with multi-linear regression which we saw earlier, predicted output:
Representing it in a matrix form:
If we add a bias so that our output is non-zero when all inputs are 0 and the above equation can be represented like this:
which is also the representation of the Multi Layered Perceptron. As you can imagine, it is multi-linear regression at heart and can only predict linear patterns. For it to be able to work with all kinds of patterns, we have to make some modifications to it, the first of which is transforming the output to fit all patterns. To do this we pass the output through different types of activation functions each suited for the pattern we want to predict. For binary classification we use a sigmoid as an activation function, for multiclass classification we use a softmax , if the context is to extract patterns in images we use a ReLu.
How about capturing different levels of complexities? Will the basic form of a neural network with different activation functions be enough?
Researchers have noticed that when we add what we call ‘hidden layers’ in between the input and output and start training, the neural network is able to represent increasingly complex aspects of our input, purely by combining lower-level features from the previous layer. To give an example: If we had 3 hidden layers and train the algorithm to identify a smiling face, one layer would capture the presence of edges of the face, the next hidden layer would capture the presence of cheeks and the final layer would activate with the presence of a smile, the sophistication increasing with the subsequent hidden layer. So stacking the neural network with hidden layers allowed it to build complexity. Infact the first version of ResNet had 152 layers in total!
Training A Neural Network: So how do we train such a network? We need two things: a loss function and an optimizer. The loss function works exactly as it did with traditional machine learning. Starting with random weights for the parameters of the neural network, we predict an output, compare it against the actual output i.e. build a loss function, adjust the parameters working backwards from the output so that the prediction is corrected. To be exact we update the weights based on the gradient of the loss function. As neural networks involve matrix multiplication, the math can get a bit tedious but the process largely remains the same.
The loss function we use to compare predicted output with real output depends on the type of output. For solving regression problems, we use mean squared error loss:
for classification problems with multiple classes we use the categorical cross entropy loss function:
for binary classification we use the binary cross entropy:
When it comes to the optimiser algorithm, the primary purpose of which is to fine-tune the parameters during training, with the goal mimsing the loss function we saw earlier. We have already seen one such optimizer algorithm called Gradient Descent which updates weights based on the learning rate and the first order differential. There are many other optimizer algorithms, most preferred of which is Adam(Adaptive moment estimation). Unlike gradient descent , which maintains a single learning rate throughout training, Adam optimiser dynamically computes individual learning rates based on the past gradients and their uncentered variance. This learning rate is adaptive and can navigate the optimization landscape efficiently during training leading to faster convergence.
To summarise things to this point: During this whole process of training, the network decides how it wants to arrange its weights, guided only by its desire to minimise the error in its predictions and the result is a set of parameters that are very good at capturing patterns in any kind of data and complexity. To truly understand the neural networks i would highly recommend watching the very visual explainer done by 3blue1brown: https://www.youtube.com/watch?v=aircAruvnKk including the 3 videos he did after this covering back propagation. I dont think I am even remotely doing justice by trying to explain it away in text.
What we did so far is build a basic neural network that is well suited for many advanced tasks in a way that traditional machine learning algorithms were not, like telling apart the image of an animal from that of a ship.
Impressive as it might be, this basic configuration is not very accurate in its predictions. The primary reason for this is that in its architecture we still ‘flatten’ the images from input into a single vector as the hidden layers after input require a flat array as opposed to a multi-dimensional one. This operation effectively kills all the information we could have learned had our network architecture captured spatial awareness. To truly understand this, just imagine the pixels of an image laid out in a single array, would you be able to understand its content? clearly not! and if we put them in a multi-dimensional array? We see things clearly. So what we want is our network to capture this spatial awareness and not flatten the image pixels!
Also, if you want to train a basic neural network on images with even basic clarity, say 32x32x3 and 3 hidden layers of 128(128,1) neurons each, the total number of parameters to optimise would already be 6.5 Billion! Not only would training these number of parameters be computationally expensive but the model would be overfit to the point where it would memorise the training data as is and produce perfect scores in training!
Convolutional Neural Network: To overcome this issue and its lack of spatial awareness we apply the concept of convolution to a neural network calling it a Convolutional Neural Network (CNN). At the heart of the CNN is the convolution operation, which is performed by matrix multiplication of a convolution filter with a portion of the image. You can think of the convolution filter as a matrix of parameters that yields a positive result if it finds the thing that it was looking for in the image, upon matrix multiplication with the image. In a way they behave like human eyes but specifically trained for a task. For ex: if the convolution filter is trained to find nose in an image it will output a positive number when it finds one.
If we move the convolution filter across an entire image from left to right and top to bottom, recording the output as we go, we obtain a new array that picks out a particular feature of the input image, depending on the values in the filter. And if we have a new convolution filter that is applied to the output of the first filter then this picks out a different feature of the input. So we stack convolution layers one after the other learning different features of our image. In reality the convolution layer closest to the image has the most zoomed in view and identifies the nitty gritty details and as we move further, the filters get zoomed out and identify broad details across the entire image. Up to this point we are just learning the features of the image with this convolution operation, we still need a fully connected layer with an activation function to go together with the output of our convolution layer and perform the actual image classification task. So in summary this is how the overall architecture of a CNN would look like:
The term pooling in the above image refers to a form of downsizing that uses a 2D sliding filter. The filter passes over the output from the convolution operation according to a configurable parameter called the stride. The stride is the number of pixels the filter moves across the input slice from one position to the next. For example, when strides = 2, the height and width of the output tensor will be half the size of the input tensor. This is useful for reducing the spatial size of the tensor as it passes through the network, while increasing the number of channels. We have 2 kinds of pooling options: Max and Avg Pooling illustrated below:
Another key thing to think of while building a CNN is batch normalisation. When training there is a possibility for the weights of a neural network to get too large, this problem is known as the exploding gradient problem. When this happens the loss function suddenly returns an NaN after hours and hours of training. To overcome this, we use a batch normalisation layer that calculates the mean and standard deviation of each of its input channels across the batch and normalises them by subtracting the mean and dividing by the standard deviation. There are then two learned parameters for each channel, the scale (gamma) and shift (beta). The output is simply the normalized input, scaled by gamma and shifted by beta.
We have seen in the case of an MLP how and why it can quickly start overfitting to the training data. In general if an algorithm performs well on the training data, but not the test data, we say that it is overfitting. To reduce this problem, we implement regularisation techniques, just like with traditional machine learning, which ensure that the model is penalised if it starts to overfit. In machine learning we have many ways of doing this but in deep learning we use dropout layers. During training, each dropout layer chooses a random set of units from the preceding layer and sets their output to 0. Incredibly, this simple addition drastically reduces overfitting by ensuring that the network doesn’t become overdependent on certain units or groups of units that, in effect, just remember observations from the training set. If we use dropout layers, the network cannot rely too much on any one unit and therefore knowledge is more evenly spread across the whole network, which is exactly what we want!
So in summary, CNN’s resolved 2 of the biggest drawbacks of basis neural network:
- Capturing spatial patterns in unstructured data
- Bringing down the number of parameters to be trained from billions to 50K–150K.
But not all unstructured datasets have a spatial pattern, data like text, audio, time series have a sequential pattern, where order of the data is most important, to which CNN’s would be a terrible choice as they only remember the image they are fed in that moment.
With this we dive into the realm of ‘auto regressive models’:
Recurrent Neural Networks (RNN) which are in reality a very basic form of the now notorious ‘generative models’ are built to capture the sequential nature of input data. Lets dive into its architecture and understand how:
The most important challenge that an RNN has to solve is to capture the context of sequential data. It is the context that defines what the next word in the sentence should be. If i say:
Excellent restaurant! Great food, great ______!
What comes next is most likely the word ‘service’. We can say this because we know the context: which is that its a restaurant and i am happy with it, commented on the food and will most probably talk about service next. Lets see how an RNN would preserve the context in this example, process this sentence and predict the word in the end. A basic form of an RNN takes 2 inputs and generates 2 outputs. 2 Inputs being: each word of the sentence one time step at a time and the hidden state weights from the previous time step(a(t-1)). Outputs: The hidden state weights for the current time step(a(t)) and the predicted output(y(t)). In different configurations an RNN can be:
In the most basic form, this is how an RNN is configured:
Here: a(t) = the hidden state weights for the time step t; Y(t) = the predicted output for the time step t; X(t) = the input for time step t; Wa, Wx, Wy being the parameter matrices plugging into the values of a, X, & Y respectively
The hidden state is what makes an RNN compatible with sequences, you can think of the hidden state at a time t as the network’s current understanding of the sequence at that time. With the passing of each time step and new inputs being added, the hidden state is updated. So the hidden state at time step t depends on the input at time step t and the previous hidden state a(t-1). Putting it into an equation:
replacing the hidden state value from the first equation into the second:
Now that we know the architecture lets understand how it works. We start our time step at 0 with the first word in the sentence as the first input X(0) = ‘Excellent’. The other input which is weight from the previous time step is 0 as this is the first time step. So a(0) = 0. We assign random initial weights to Wa, Wx and Wy and calculate both the outputs: weights for the next time step- a(1) and the predicted word- y(1). We then use the a(1) value as the input for the next time step t=1. The other input will be the next word in the sequence- ‘Restaurant’. We can already see that the when working with the second word, the first word is also passed as an input. So the neural network has the entire sequence to that point as context. We move forward through next time steps in the same way until we reach time step t=4, where we predict the final word.
A major question arises at this point. We used random weights to get the outputs at each time step. So our predictions are bound to be random too. So how can we train it to get better weights and make accurate predictions? Can we use the same back-propagation method that we used in training CNN? Yes, we can! But the training takes place for each time step! Training the initial time steps is easy but it gets complicated as the time steps grow resulting in the single biggest problem of using RNN’s at scale — The vanishing/exploding gradient! We have 3 parameter matrices to optimize here so we calculate gradient w.r.t all of them. The gradient with respect to the weights of the input (Wx) is optimized quickly as its done at each current time step without many dependancies, but the gradient with respect to the weights of the hidden layer (Wa) and the input (Wx) is what can potentially cause issues.
Every single hidden layer that participated in the calculation of the final output, should have its weight updated in order to minimize that error. As changing the weights of the first hidden layer affects the output of even the final layer, we have to optimize its weights too:
Adding the cost function:
Applying the chain rule to get the derivative of the output w.r.t Wa:
Here in the last fraction ∂a4/∂Wa, a4 is a function of a3 and Wa, both of which are variables here and cant be treated as constants. So applying the chain rule that if in a f(x, y) each of x and y are again functions of two variables u and v (i.e., x = x(u, v) and y = y(u, v)) then:
Applying this to our equation, it becomes:
Here we have to apply chain rule to get the partial derivative of a4 w.r.t a1:
Now we have one of 2 situations: The value of each of the partial derivatives is either more than 1 or if there is a batch normalization process or a squashing activation function the values will be less than 1. Lets assume first that they are more than 1, say 2, the gradient quickly becomes:
Gradient = 2 * 2 * 2 * 2 * 2 * 2 = 64 !
Now when we try to use this gradient value to actually back propagate through the network by changing the parameter weights using:
It becomes harder and harder to control the process as the sentence length goes longer. This is the exploding gradient problem. In reality we can cap the value and work around it (gradient clipping). There is a problem that we cannot work around and thats the vanishing gradient problem. Assuming we use batch normalization to make the weights between -1 and 1 we face a different issue. For a value say 0.8 the gradient will now be:
Gradient = 0.8*0.8* 0.8* 0.8* 0.8*0.8= 0.26!
Now we wouldnt be able to change the weights much at the beginning of the time steps, because gradient has almost vanished (hence the name: vanishing gradient problem). In a typical sentence with 10–20 words, we can imagine these problems becoming much worse. A better application of an RNN would be auto-fill as the characters in a word have a finite length, this is the reason why they were brought to production much earlier:
So in summary: RNNs are able to handle sequential data much better because of their ‘memory’. But they suffer from vanishing and exploding gradient problems, especially when the sequences are long, making them very hard to train. Also RNNs are only good at remembering recent inputs well. The weight of the previous inputs decreases faster as we go further up the sequence. In language it is very important to remember specific inputs from earlier as they set the context for what is to come next more than the most recent input. So how can we overcome these issues with an RNN?
Long short term memory, uses many short term memory cells to create a long term memory. Its a type of RNN that uses a memory cell and gates to control the flow of information, so it can selectively retain or discard information avoiding the vanishing gradient problem of RNNs. For this reason almost no one uses a recurrent neural network anymore but use some or the other variant of LSTM. Lets look at the architecture of the base variant: