Transformer Architectures for Multimodal Signal Processing and Decision Making | ICASSP 2022 Tutorial | by Yi Kuan

Note: Full video can be found here. It is the tutorial on ICASSP 2022 Tutorial on “Transformer Architectures for Multimodal Signal Processing and Decision Making” by two instructors: Chen Sun and Boqing Gong.

The tutorial aims at providing the audience with knowledge about the transformer neural architectures and related learning algorithms.

The Transformer architectures have become the preferred models for natural language processing (NLP). In computer vision, there has been a recent increase in interest for end-to-end Transformers. This has led to efforts to replace manual feature engineering and biases with general-purpose neural architectures trained on data. Transformer architectures have also achieved state-of-the-art performance in various areas like multimodal learning, protein structure prediction, and decision making.

These results demonstrate the Transformer architectures’ significant potential beyond the mentioned domains, particularly in the signal processing (SP) community.

Then, we are going to introduce different multimodal models related to signal processing with cross-modality.

In recent research, “Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization” [1], Whisper has been prompted to perform audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) tasks using unseen language pairs.

Figure 1: Framework for visually prompting Whisper.

In this work, they utilize the famous vision-and-language CLIP [2] model as their image encoder along with an external vocabulary of common object words.

This task is a broader variation of audio-visual speech recognition (AVSR), which involves recognizing speech audio while simultaneously considering the accompanying video of the speaker’s facial or lip movements.

To provide Whisper with a visually-conditioned prompt, they employ the famous and popular vision-and-language CLIP model and an external vocabulary of common object words. This allows them to convert the visual stream into a sequence of word tokens. By constructing sentences using the template “This is a photo of a { }” for each word/phrase in the external vocabulary, they pre-compute embedding vectors using the CLIP text encoder in an offline manner.

During inference, they sample three equally-spaced RGB image frames from each video and use the CLIP image encoder to embed them. They calculate the similarity between the image embeddings and the pre-computed text embeddings. Based on the highest similarity scores, they select the top K objects whose embeddings correspond to the image prompt. These selected object names are concatenated into a comma-separated list of words, which is inserted into the previous text slot of the prompt.

In addition, they found interesting properties of Whisper — in AVSR, the model is very robust to the length and noisiness of the visual prompt, and the effectiveness of the visual prompt between English models and multilingual models are quite different

In “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation” [3] , they present two contributions that stem from both the model and data perspectives.

First, a new model architecture called MED (Multimodal Mixture of Encoder-Decoder) is introduced as a means of achieving efficient multi-task pre-training and adaptable transfer learning. MED offers the flexibility to function as a unimodal encoder, an image-grounded text encoder, or an image-grounded text decoder. It encompasses three vision-language objectives: image-text contrastive (ITC) learning, image-text matching (ITM), and image-conditioned language modeling (LM).

(Note: ITC, ITM and LM had been briefly introduced in [LINK])

Figure 2: Architecture of multimodal mixture of encoder-decoder (MED).

Second, they propose a new dataset boostrapping method for learning from noisy image-text pairs. The approach involves fine-tuning a pre-trained MED model into two distinct modules. The first module, called the captioner, generates synthetic captions based on web images. The second module, known as the filter, is responsible for eliminating noisy captions from both the original web text and the synthetic texts.

Figure 3: Cap is for captioner to produce synthetic captions | Filt is for filter to remove noisy captions.

The results indicate that by utilizing bootstrapped captions, the captioner and the filter collaborate to significantly enhance performance in different downstream tasks.

In another work, “ClipCap: CLIP Prefix for Image Captioning” [4], they attempt to enable the GPT-2 model to possess the ability to comprehend images, similar to how to prompt Whisper model to make it understand images and provide image captions. In detail, they use CLIP as image encoder for extracting image features, which can represent unified representation for both images and text prompts. Then, extracted image features are fed into a trainable mapping network in order to generate prefix embeddings. These prefix embeddings are finally prepended to the input and fed into the language model.

In “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers” [5], they utilize speech discretization to bridge the gap between speech and text modalities. They train a neural codec language model, called VALL-E, using discrete codes obtained from a neural audio codec model. Additionally, they treat text-to-speech (TTS) as a conditional language modeling task, departing from previous approaches that employed continuous signal regression.

In “SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities” [6], they introduce a large language model with intrinsic cross-modal conversational capabilities, capable of perceiving and generating multimodal content. By utilizing a self-supervised trained speech model, they perform speech discretization to bridge the modality gap between speech and text. The discrete speech tokens are subsequently expanded into the language model’s vocabulary, thereby empowering the model with the inherent ability to perceive and generate speech.

Figure 6: An illustration of SpeechGPT model structure.

Figure 7: An overview of SpeechInstruct construction process.

This work shows the development of a multi-modal large language model capable of perceiving and generating multi-modal content. SpeechGPT, the first spoken dialogue LLM, showcasing its proficiency in understanding human instruction and engaging in spoken dialogue. Furthermore, they demonstrate the potential for integrating additional modalities into LLMs using discrete representations.

[1] Peng, P., Yan, B., Watanabe, S., & Harwath, D. (2023). Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization. arXiv preprint arXiv:2305.11095.

[2] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.

[3] Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (pp. 12888–12900). PMLR.[4] ClipCap: CLIP Prefix for Image Captioning, 2021.

[5] Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., … & Wei, F. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.

[6] Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., & Qiu, X. (2023). Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000.

[7] Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7464–7473).

[8] Akbari, H., Yuan, L., Qian, R., Chuang, W. H., Chang, S. F., Cui, Y., & Gong, B. (2021). Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 24206–24221.

[9] Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., & Cao, Y. (2021). Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.

[10] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., & Wu, Y. (2022). Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.

Source link