![](https://crypto4nerd.com/wp-content/uploads/2023/11/10AKHNL0I72kKaJPk3BvR3g-1024x576.jpeg)
Transformers
The transformer architecture was introduced in a 2017 paper by Vaswani et al. from Google Brain. The paper, titled “Attention is All You Need,”¹ proposed the transformer architecture to solve sequence-to-sequence model limitation to retain information from the first elements when new elements were incorporated into the sequence.
Initially, it was created for the machine translation task, but its effectiveness on this task has since been extended to a wide range of natural language processing applications, including language modeling, text classification, and question answering. It has also been successfully applied to computer vision tasks, such as image captioning and visual question answering.
To truly grasp the transformers, it’s essential to understand the attention mechanism, which is the cornerstone of this architecture. Attention mechanisms in neural networks were inspired by the intuitive way humans focus on different parts of an image or passage of text when trying to understand or analyze it. In a similar fashion, attention in neural networks allows the model to weigh the importance of different parts of the input data when making predictions.
The Mechanics of Attention
For computers, all data must be represented numerically. While humans intuitively grasp the meaning of the word “jam,” computers must convert it into a number or a sequence of numbers. This presents a challenge because the word “jam” carries distinctly different meanings in the sentences: “The bread has a delicious strawberry jam” and “I spent two hours in the traffic jam.” Basically, it is necessary to pay attention to the context in which “jam” is used, recognizing that in one instance, it refers to strawberry preserves, while in the other, it denotes a traffic delay.
To address this, natural language processing employs word embeddings, which map words into a high-dimensional space where the distance and direction between vectors help capture context. However, embeddings alone are not enough to resolve ambiguity. This is where a specific attention mechanism called self-attention comes into play. They enable the model to dynamically focus on different parts of a sentence to derive meaning. In our examples, an attention mechanism would likely associate “strawberry” and “delicious” closely with “jam” in the first sentence, linking it to food. In contrast, in the second sentence, “traffic” and “two hours” would inform the model that “jam” relates to a congestion scenario.
Embeddings and self-attention allow machines to perform interpretations of words with multiple meanings based on their context. This is important for tasks such as machine translation, sentiment analysis, and information extraction, where understanding the meaning of words in a context (e.g., “jam” in our example) can significantly change the result.
To better understand, let’s begin with an example. The initial phase of the transformer model, known as “Input Embedding,” converts all words in a sentence into their respective embedding representations. Typically, each word is represented by a vector containing a predefined number of elements, often 512. Take the sentence “I spent two hours in the traffic jam,” for instance; this would result in eight vectors, one for each word, with each vector comprising 512 elements. Notably, the initial embedding vector for the word “jam” is the same in both sentences: “I spent two hours in the traffic jam” and “The bread has a delicious strawberry jam.” However, these are not the final representations of the words within the model. Through subsequent layers that incorporate self-attention, the transformer refines these embeddings by considering the surrounding context. Consequently, this yields a nuanced representation of “jam” — in the first sentence, it’s linked with “traffic,” and in the second, with “strawberry.”
As I have already mentioned, the word representation vector typically contains a large number of elements; for instance, the original paper refers to 512 elements. However, if we consider a highly simplified two-dimensional representation of word embeddings, we can visualize words as points on a plane. For example, suppose “traffic” is represented by the vector [4, 10], “jam” by [7, 7], and “strawberry” by [10, 4]. If “jam” is equidistant from “traffic” and “strawberry” in this representation, it could imply that, within our simplified model, “jam” shares some abstract relationship with both “traffic” and “strawberry.” It is important to remember that this is a significant oversimplification; actual word embeddings are multi-dimensional and capture a complex network of relationships that two dimensions alone cannot fully represent.
To apply the attention mechanism, a score is calculated for each word in relation to every other word in the sequence. This score determines how much focus should be placed on other parts of the input when encoding a particular word. In our simplified two-dimensional space, imagine calculating the influence scores between “jam” and “traffic,” and “jam” and “strawberry.” The attention mechanism would assign a higher score to “traffic” when “jam” appears in a context related to congestion, and a higher score to “strawberry” when “jam” is used in a culinary context.
The result is a set of output vectors that are deeply contextualized, reflecting not just the standalone meanings of words, but also how the surrounding words influence those meanings in the sentence. In our simplified model, the vector representing “jam” would shift closer to “traffic” in the context of a “traffic jam,” and closer to “strawberry” in the context of “strawberry jam.” But remember, this is a simplified two-dimensional scenario to understand the idea behind the attention mechanism better. In the real high-dimensional space scenario, the concept of ‘closeness’ is much more abstract than physical proximity on a two-dimensional plane.
One important point that needs to be referred to is that, because the Transformers don’t process text sequentially, they need a way to understand the order of words in a sentence. This is achieved through positional encodings, which are added to the embeddings to provide information about the position of each word in the sentence.
The Transformer architecture can process and analyze entire text blocks in a single operation. This is due to the ability to process data in parallel depending on the number of layers used in the model (e.g., the largest GPT-3 model has 96 layers). This parallel processing capability speeds up the learning process and enables the model to capture more complex patterns and relationships.
“you just can’t differentiate between a robot and the very best of humans.”
― Isaac Asimov, I, Robot
The Magic of Understanding Words
At this point, you understand the self-attention mechanism used in the Transformers architecture. It is important to note that the output of the sef-attention is not only a way to represent each word in a sentence with context. It’s much more. The output is the word embeddings that have the semantic meaning of each word in the sentence in the context and language. They transform words into a format that the model can process, understand, and analyze the relation to each other within the specific context of the sentence.
When computers start to understand words and can represent them mathematically, we can start to establish a mathematical relation between words. For example, we can group “banana,” “apple,” and “orange” as fruits, or see that “King” is to “man” as “Queen” is to “woman.” It also helps guess the next word in a sequence or equivalent sentence in another language. All this comes down to figuring out the right math equation that shows these links. As explored in previous chapters, neural networks are essentially complex equations. Their structure and functionality are defined by the training data they are exposed to.
Encoder vs. Decoder
The Transformer model architecture consists of two main parts: the encoder and the decoder. The first part (e.g., the encoder) reads and processes the input text, while the second part (eg., the decoder) uses the output of the first part to generates the output of the model. Each of these parts comprises multiple layers containing self-attention and feedforward neural networks.
In the first step, the decoder takes the input sentence and applies a self-attention mechanism to understand the relevance and relationship of each word to the others. Then, a Feedforward Neural Network (FFN) is applied to each word separately, allowing to get an output focus on specific aspects or features relevant to the task at hand, such as language understanding for translation or question answering.
So, the encoder uses a combination of self-attention to understand the context and relationships between words in the input, and a feedforward neural network to refine and adapt these representations for specific tasks or objectives.
The decoder works in an iterative way, using the output of the encoder, soft-attention mechanism, and feedforward neural networks that depend on the type of task to perform, to generate the output text.
Let’s explain with a translation example. Imagine you want to translate the sentence “The sky is blue” to the Portuguese sentence “ O céu é azul.” The decoder starts with a start-of-sequence token (let’s say “<start>”). The decoder uses the “<start>” token and the output from the encoder to generate the first word of the translation. Let’s assume it correctly predicts “O.” Now, the decoder takes both the “<start>” token and “O” as input to predict the next word. Let’s say it correctly predicts “céu.” The process continues with the decoder taking “<start> o céu” as input for the next word, and so on, until the sentence is fully translated and an end-of-sequence token is generated.
A Family of Families
I hope that you now have a good understanding of how Transformers work. You’re likely familiar with some of the most popular applications of Transformer technology, such as ChatGPT. However, the world of Transformers is vast and diverse, with several families of models, each with unique characteristics and applications. Let’s explore some of these families:
- GPT (Generative Pre-trained Transformer): Developed by OpenAI, the GPT series, including GPT-3, and the new GPT-4. It is known for its ability to generate human-like text. More here.
- LLaMA (Large Language Model Meta AI): It is a family released by Meta AI. It’s designed to perform well with less computational power, making it more accessible for various applications. More here.
- BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, BERT is designed to understand the context of a word in a sentence by looking at the words that come before and after it. More here.
- LaMDA (Language Model for Dialogue Applications): Created by Google, LaMDA is designed specifically for conversational applications. It focuses on producing more sensible and specific responses in a dialogue. More here.
- Chinchilla: It is a large language model developed by DeepMind. It emphasizes the importance of training data size and quality. More here.
- RoBERTa (A Robustly Optimized BERT Pretraining Approach): Developed by Facebook AI, builds upon BERT’s architecture but is optimized with more training data and a longer training process. More here.
Each of these Transformer families has contributed significantly to the field of natural language processing, offering unique approaches and solutions to complex language tasks. As technology continues to evolve, we can expect to see even more innovative applications and developments in this exciting area.
Finalizing …
As we conclude this series on the ABC of Deep Learning, it’s important to recognize that the journey of understanding and utilizing deep learning, especially transformers, is just beginning. The rapid evolution in this field is not just a testament to technological advancement but also to the endless possibilities that these models unlock.
Despite their impressive capabilities, transformers are not without challenges. One significant issue is the computational resources required for training and running these large models. This raises concerns about environmental impact and accessibility for researchers and organizations with limited resources. However, this challenge also opens doors to innovation in developing more efficient models, as seen with LLaMA and other similar initiatives.
Another critical aspect is the ethical use of these technologies. As language models become more advanced, issues like bias in training data, misuse of generated content, and privacy concerns become increasingly important to address. Ensuring responsible and ethical use of these technologies is paramount as we move forward.
The future of transformers and deep learning as a whole is incredibly promising. We are likely to see continued improvements in model efficiency, effectiveness, and versatility. The integration of these models into various sectors will further transform how we interact with technology, making it more intuitive and aligned with our natural communication styles.
Thank you for joining me on the ABC of Deep Learning. I hope these articles serve to provide you with a small understanding of deep learning and inspire you to delve deeper into the world of deep learning.