![](https://crypto4nerd.com/wp-content/uploads/2024/02/1I7FqcFOJpYRVXDatqJBj8Q-1024x512.png)
In recurrent neural networks, parsing long sequences can pose significant challenges. In our previous blog post, we explored the difficulties that arise from looking at the gradient term of the RNN architecture, namely the occurrence of vanishing or exploding gradients. Fortunately, the LSTM architecture has been developed to solve these obstacles, utilizing a similar approach to skip connections. So, let’s look deeper into LSTMs and how they address these issues.
LSTM is an architecture that was introduced to try and reproduce the concept of skip connections in RNNs. the model itself is comprised of a concatenation of identical “cells”, and each cell has a few components that are in charge of various applications
let us break down the cell into its main components:
The Cell State
represented with ct, is the analog of the skip connection. The cell state is to be thought of as the “memory” that it passed along from previous cells. With that intuition in mind, notice that ct can be multiplied by a number ∈(0,1)∈(0,1) (output of some sigmoid), and this will allow us to control how much from past iterations we wish to “remember” (0 — forget the history and 1 let it pass fully). Finally, our ct is added with “++” to our processing of ht−1 and xt, which is important as we remember that when deriving concerning our network parameters ∂ct/∂θ=0, and therefore ct will not contribute to vanishing gradients
The Forget State
both the current input xt and the output from the previous cell ht−1 are inserted as input to a layer that outputs some value ft=σ(Wf⋅[ht−1,xt]+bf)∈(0,1)^n (the notation [⋅,⋅] means concatenation). This value, as described before, is what chooses whether to “forget” or “remember” ct.
the current input and previous output can update the previous memory by adding to it new memory tilde C=tanh(WC⋅[ht−1,xt]+bC). Before tilde C is added to the previous c, we also calculate i=σ(Wi⋅[ht−1,xt]+bi) which will provide a percentage for how much of tilde C we wish to transfer. We use tanhtanh as it allows us an update that is both addition and subtraction (as tanh(x)∈[−1,1])
Forgetting and Updating Cell State
This state is responsible for updating the memory c given the outputs of the Update state and Forget state, that is it sets ct=ft×c_(t−1)+it × tilde C t, where × is element-wise multiplication.
The Output State:
The last state is responsible for generating the output of the current cell, which is the hidden state ht, as well as propagating ht to the text cell alongside Ct. Specifically we have ot=σ(Wo[ht−1,xt]+bo) and ht=ot×tanh(ct), which indicates that the hidden state is some factor of the output with the previous layers’ memory.
Provided an LSTM architecture, the problem of vanishing gradients greatly reduced, as the cell state can pass either long or short memories without changing them, which is what skip connection is all about. This allows us to build a deep model with elongated time dependencies. Do notice though that there are still assignments to the parameters of the LSTM that will result in a vanishing gradient, and therefore the method isn’t bulletproof.
While the LSTM architecture has been a popular choice due to its ability to capture long-term dependencies and prevent the vanishing gradient problem, the Gated Recurrent Unit (GRU) architecture has gained attention as a simpler yet effective alternative to LSTM. GRU has fewer parameters and can be faster to train, while still being able to capture long-term dependencies and handle variable-length sequences.
The GRU could be summarized neatly in the following figure
The first thing to notice is we do not use ct anymore — the hidden state ht will hold the information that was previously attributed to ct. More formally, ht can be either updated (using zt) or rebooted (using rt) in the following manner: if we choose to reset, rt will be set to (0,0,…,0) and zt would be set to (1,1,…,1). This will result in an updated hidden state ht that is only affected by the input of the current time stamp, xt. eventually, the hidden state will be a factor of both the updated state tilde h t and the previous stateht−1, as a sum with factoring by zt — if zt is closer to 1, ht will be less similar to the previous ht−1, and vice versa (see figure). Also notice that the update is performed using tanhtanh, as to incorporate both addition and subtraction.