![](https://crypto4nerd.com/wp-content/uploads/2024/03/1y3xWkTXCEFu_L2JHJ-HVRQ.png)
In recent years, there has been a remarkable surge in the development of generative AI, showcasing the potential of models to create novel and imaginative content. While strides have been made in various domains, the realm of video generation stands as a promising frontier.
Recent advancements suggest that scaling up these models could further enhance their capabilities. However, there remains a significant gap between the interactive engagement offered by video generative models and the rich interactions facilitated by language tools like ChatGPT, not to mention more immersive experiences.
In response to this challenge, in a new paper Genie: Generative Interactive Environments, a research team from Google DeepMind and University of British Columbia presents Genie, the first generative interactive environment capable of seamlessly generating a diverse array of controllable virtual worlds based on textual prompts, synthetic images, photographs, and even sketches.
At the heart of Genie lies a fusion of cutting-edge concepts derived from state-of-the-art video generation models, with a central design principle revolving around spatiotemporal (ST) transformers. Leveraging a novel video tokenizer and a causal action model, Genie extracts latent actions to facilitate the creation of dynamic content. These latent actions, alongside video tokens, are fed into a dynamics model, which autonomously predicts subsequent frames utilizing MaskGIT.
Genie comprises three fundamental components: a latent action model, responsible for inferring the latent action between each frame pair; a video tokenizer, which translates raw video frames into discrete tokens; and a dynamics model, tasked with predicting forthcoming frames based on given latent actions and past frame tokens.
The model undergoes a two-phase training process following a conventional autoregressive video generation pipeline. Initially, the team focuses on training the video tokenizer, which subsequently aids the dynamics model. Subsequently, they co-train the latent action model (trained directly from pixels) alongside the dynamics model (trained on video tokens).
Empirical findings underscore Genie’s prowess in generating high-quality, controllable videos spanning diverse domains using solely video data. Moreover, the acquired latent action space enables the training of agents to emulate behaviors observed in previously unseen videos, paving the way for the development of versatile, generalist agents in the future.
The paper Genie: Generative Interactive Environments is on arXiv.