In the realm of deep learning, especially within natural language processing (NLP) and image processing, three prevalent architectures often come into the discussion: encoder-decoder, encoder-only, and decoder-only models. These architectures are the backbone for various applications, including machine translation, text summarisation, and image processing. Almost all new models are transformers now! This article aims to elucidate the practical benefits and limitations of encoder-decoder. The main purpose of this paper is to understand when encoder-decoder is a solution of choice and when its usage should be avoided.
The encoder-decoder architecture is designed to handle sequence-to-sequence tasks, where the input and output are sequences that can vary in length. This architecture consists of two main components:
- Encoder. The encoder processes the input sequence, transforming it into a fixed-sized context vector that represents the input sequence’s information. It essentially compresses the input into a latent-space representation.
- Decoder: The decoder takes the context vector as input and generates the output sequence. It decodes the information contained in the context vector step by step to produce the output.
Where f is a function implemented by the encoder (e.g., a series of RNN, LSTM, or Transformer blocks).
The decoder then generates the output sequence Y from C, often in a step-wise manner, where each step t involves generating yt based on C and all previously generated outputs (y1 ,y2,…,yt−1).
Where g is a function implemented by the decoder.
Example: Machine Translation
In a machine translation task, the encoder processes a sentence in the source language and compresses it into a context vector. The decoder then uses this vector to generate the translated sentence in the target language, one word at a time. The more sophisticated “translation” example is in chemistry when we need to convert one chemical notation to another (Image with molecule structure to its IUPAC name). Encoder-decoder is great for this task as it can potentially perfectly learn to encode images and pass them to decoder part for sequence (IUPAC name) generation.
This section represents the collection of research papers highlighting the potential of encoder-decoder models in various tasks.
This paper presents an encoder-decoder model for generating sentence embeddings that can be effectively used in zero-shot cross-lingual transfer tasks, showcasing the encoder-decoder’s versatility in handling multiple languages.
“The authors introduce an innovative architecture designed to learn joint multilingual sentence representations across 93 languages from over 30 different families and written in 28 scripts. Their system employs a single BiLSTM encoder that utilizes a shared byte-pair encoding vocabulary for all languages. This encoder is augmented with an auxiliary decoder and trained on publicly available parallel corpora. This training approach allows for the learning of a classifier using only English annotated data, which can then be transferred to any of the 93 languages without modifications. The authors’ experiments across various datasets — cross-lingual natural language inference (XNLI), cross-lingual document classification (MLDoc), and parallel corpus mining (BUCC) — demonstrate the effectiveness of their method. Additionally, they introduce a new test set comprising aligned sentences in 112 languages, showcasing the strong performance of their sentence embeddings in multilingual similarity search tasks, even for languages considered low-resource. This work marks the first successful attempt at developing general-purpose, massively multilingual sentence representations, highlighting its potential in facilitating cross-lingual understanding and applications.”
This work introduces DETR, an encoder-decoder model that uses transformers for object detection, demonstrating the applicability of encoder-decoder models beyond NLP to computer vision tasks.
“The authors introduced DETR (DEtection TRansformer), a novel design for object detection systems that leverages transformers and a bipartite matching loss for direct set prediction. This approach achieved results comparable to those of an optimised Faster R-CNN baseline on the COCO dataset, a benchmark for object detection tasks. DETR is noted for its straightforward implementation and flexible architecture, which can be easily extended to tasks such as panoptic segmentation, which also delivers competitive results. A notable advantage of DETR is its significantly improved performance in detecting large objects, attributed to the global information processing capabilities of self-attention mechanisms inherent in transformers. However, introducing DETR brings new challenges, particularly in training, optimisation, and performance on small objects. These are areas where current detectors have evolved over several years to address similar challenges. The authors express optimism that future work will successfully overcome these challenges for DETR, indicating a promising direction for further advancements in object detection technology.”
T5 (Text-to-Text Transfer Transformer) uses an encoder-decoder architecture for a wide range of text-based tasks, illustrating the flexibility of this model structure in adapting to different NLP challenges.
“In their comprehensive study, the authors explore a text-to-text framework that trains a single model across various text tasks, achieving comparable or state-of-the-art results without task-specific architectures. They found the original encoder-decoder Transformer architecture to be most effective within this framework, even when parameter sharing reduced the total count by half. Their introduction of the “Colossal Clean Crawled Corpus” (C4) facilitates improved performance in language understanding tasks across a vast dataset.
Key findings include the efficiency of denoising objectives for pre-training, the advantage of updating all model parameters during fine-tuning, and the significant benefits of scaling up model size and employing model ensembles for enhanced performance. Despite exploring multi-task learning and various scaling strategies, they highlight the straightforward approach of unsupervised pre-training followed by supervised fine-tuning as the most effective.
Looking ahead, the paper emphasizes the potential of smaller, more efficient models for low-resource applications, the exploration of more effective knowledge extraction methods, and the development of language-agnostic models to extend the reach of NLP tasks across languages. The release of their code, dataset, and model weights encourages further research in these directions, contributing to the field’s advancement towards general language understanding.”
This paper uses encoder-decoder models for voice conversion tasks, leveraging the transformer architecture’s strengths in handling sequential data.
“The study introduces a pioneering approach to voice conversion (VC) by leveraging transfer learning from a text-to-speech (TTS) synthesis system, termed TTS-VC transfer learning (TTL-VC). A multi-speaker TTS system with a sequence-to-sequence encoder-decoder architecture is developed. The encoder processes input text to extract linguistic representations, while the decoder, informed by a target speaker embedding, utilizes context vectors and the output of an attention recurrent network cell to produce target acoustic features. This TTS system’s ability to map input text to speaker-independent context vectors supervises the training of latent representations in a voice conversion system, where the encoder inputs speech and the decoder mirrors the TTS decoder’s functionality. By conditioning the decoder on a speaker embedding, the system enables training on non-parallel data for any-to-any voice conversion. At runtime, the VC network operates independently of text input. Experimental results indicate that the TTL-VC system surpasses existing voice conversion baselines, including phonetic posteriorgram and AutoVC, in speech quality, naturalness, and speaker similarity.”
This paper extends the transformer model to graph-structured data, using an encoder-decoder framework to handle graph representation learning, which is pivotal for bioinformatics, social network analysis, and more tasks.
“The survey presents an in-depth examination of Graph Transformer models, focusing on their architectural designs. The authors analyze existing models and identify three primary methods for integrating graph information into the standard Transformer architecture: utilizing Graph Neural Networks (GNNs) as auxiliary modules, enhancing positional embeddings with graph-derived information, and refining the attention matrix using graph characteristics. They implement key components from each of these categories and evaluate them across multiple well-known graph data benchmarks. This rigorous testing aims to quantify the actual performance improvements attributed to each graph-specific module. The findings affirm the value of incorporating graph-specific modules into Transformer models, highlighting their effectiveness in various graph-related tasks. This comprehensive review sheds light on how different approaches to integrating graph information can optimize Transformer models for a wide range of applications in graph data analysis.”
Cognitive approximations for complex inflectional systems
This study investigates the challenges that encoder-decoder (ED) architectures face in modeling the German number inflection system, particularly regarding plural suffixes. Contrary to some claims, the experimental speaker data suggests that the suffix /-s/ is not the sole ‘default’ plural marker for phonologically unfamiliar words; the suffix /-(e)n/, which is more frequent, exhibits similar trends. Despite this, the German plural system poses difficulties for ED models, especially in accurately predicting the distribution of the /-s/ suffix for existing German nouns.
When applied to novel nouns, the neural model tended to generalize using the contextually most frequent plural marker /-e/, leading to predictions that were less variable than actual speaker productions. These predictions also varied between phonologically typical (Rhymes) and atypical (Non-Rhymes) words, indicating a discrepancy in how the model handles different types of input compared to human speakers.
The findings suggest that, irrespective of the debate over the ‘minority-default’ status of certain plural markers, ED models may not serve as effective cognitive approximations for complex inflectional systems like German number, where no single class dominates. This highlights the need for further research to enhance the capability of neural models to more accurately reflect the nuances of human language processing and inflectional variability.
High inference latency
The study addresses the issue of high inference latency in encoder-decoder models caused by autoregressive decoding, where the generation of each output token is conditioned on previously generated tokens, necessitating sequential token generation and repeated feedforward processes in the decoder. This mechanism significantly contributes to inference latency, predominantly within the decoder component.
Observations from the research indicate that even with a single decoder layer, encoder-decoder models can achieve reasonable prediction accuracy, suggesting that the extensive computations in deeper decoder layers may not always be necessary for correct predictions. Motivated by this insight, the authors propose a novel approach named Dynamic Early Exit on Decoder (DEED), aimed at enhancing inference speed without compromising accuracy. DEED introduces a multi-exit architecture that dynamically determines the optimal point for exiting at a specific decoder layer during each decoding step, based on confidence levels in the prediction.
Generalization to other domains
The problem addressed in this study is the challenge of domain shift, a common issue in Deep Neural Networks (DNNs) where the test dataset has a different distribution from the training dataset, leading to decreased model performance. This phenomenon occurs because traditional DNN models are typically trained on a specific domain or dataset and tend to learn features specific to that domain, which may not generalize well to new, unseen domains.
To tackle the problem of domain shift and enhance the generalization ability of DNNs across diverse domains, the authors introduce a novel approach centered on the explicit removal of domain-specific features from the training data. The proposed framework, named Learning and Removing Domain-specific features for Generalization (LRDG), aims to create a domain-invariant model capable of performing accurately on unseen domains. LRDG strategically identifies and eliminates features that are specific to the source domains from the input images. This is achieved through a two-step process involving a specially designed classifier to identify domain-specific features and an encoder-decoder network that transforms the input images into a new image space, effectively stripping out these identified features. The transformed images are then used to train another classifier focused on learning domain-invariant features, enabling the model to classify images without being biased by the specific characteristics of the training domain.
The significance of this problem lies in its widespread implications for the application of DNNs in real-world scenarios, where data from different domains often exhibit substantial variations. By addressing the issue of domain shift through the removal of domain-specific features, the LRDG framework offers a promising solution to enhance the robustness and applicability of DNNs across a variety of domains, as supported by the superior performance of the LRDG framework in extensive experiments compared to state-of-the-art methods.
In conclusion, encoder-decoder architectures are more universal and flexible than decoder counterparts. They can be used on various text-to-text tasks and even in multi-modal setups starting with text-images and ending with graphs and chemical formats. Usually, an encoder is a bi-directional transformer that was trained on discriminative tasks, it’s very beneficial for information extraction tasks that require better contextual understanding. A decoder can be a much smaller neural network that makes encoder-decoder models more flexible and efficient. However, the issue with such architectures is that they are more complex, and we still don’t have a proper understanding of what is the best choice of pre-training task for encoder-decoders. Moreover, the progress of only decoder models makes encoder-decoders stacked in time, and we have no pre-trained models with some novel attention mechanisms such as FlashAttention, also quantization of models is very limited. But we believe that this article showed enough proofs to make encode-decoders more interesting and useful for the community. Anyway, we will push they development with our own resources.