3 October 2023 | Research
Scaling GAIA-1: 9-billion parameter generative world model for autonomous driving
In June 2023, we unveiled GAIA-1 as the first proof of concept of a cutting-edge generative model for autonomous driving. We’ve spent the last few months optimising GAIA-1 to efficiently enable the ability to generate videos at higher resolution and improve the world model’s quality with larger-scale training. In this blog post, we are excited to release the technical report of GAIA-1 and the results of scaling GAIA-1 to over 9 billion parameters.
Montage of images from driving scenarios generated by GAIA-1
GAIA-1 is a cutting-edge generative world model built for autonomous driving. A world model learns representations of the environment and its future dynamics, providing a structured understanding of the surroundings that can be leveraged for making informed decisions when driving. Predicting future events is a fundamental and critical aspect of autonomous systems. Accurate future prediction enables autonomous vehicles to anticipate and plan their actions, enhancing safety and efficiency on the road. Incorporating world models into driving models yields the potential to enable them to understand human decisions better and ultimately generalise to more real-world situations.
GAIA-1 is a model that leverages video, text and action inputs to generate realistic driving videos and offers fine-grained control over ego-vehicle behaviour and scene features. Due to its multi-modal nature, GAIA-1 can generate videos from many prompt modalities and combinations.
GAIA-1 can generate realistic videos of driving scenes with high levels of controllability. In the example below, we see a short video generated by GAIA-1 where the model generates night-time driving data with snow on the road.
Model architecture and training
The figure below depicts the general architecture of GAIA-1.
First, GAIA-1 encodes all inputs through specialised encoders for each modality (video, text, and action). These encoders project the diverse input sources into a shared representation. Text and video encoders discretise and embed the input, while scalars representing actions are independently projected to the shared representation. These encoded representations are temporally aligned to ensure they share a coherent timeline.
Following this alignment, the model’s core component, the world model, comes into play. The world model is an autoregressive transformer. This transformer predicts the next set of image tokens in the sequence. It achieves this by considering not only the past image tokens but also the contextual information provided by text and action tokens. This holistic approach allows the model to generate image tokens that are not only visually coherent but also aligned with the intended textual and action-based guidance. GAIA-1’s world model has 6.5 billion parameters and was trained for 15 days on 64 NVIDIA A100s.
Finally, the video decoder, a video diffusion model, is employed to translate these predicted image tokens back into the pixel space. The video diffusion model plays a crucial role in ensuring that the generated videos are semantically meaningful, visually accurate, and temporally consistent, enhancing the overall quality of the generated content. GAIA-1’s video decoder has 2.6 billion parameters and was trained for 15 days on 32 NVIDIA A100s.
GAIA-1 has more than 9 billion trainable parameters (compared to 1B parameters from the June version of GAIA-1). GAIA-1’s training dataset consists of 4,700 hours of proprietary driving data collected in London, UK, between 2019 and 2023.
The way the world modelling task is formulated in GAIA-1 closely resembles the approach commonly employed in large language models (LLMs). In both cases, the task is simplified to the prediction of the next token. While this methodology is applied to video modelling within GAIA-1 rather than language, it’s worth noting that similar scaling laws, akin to those observed in LLMs, also apply to GAIA-1.
The plot below shows the scaling curve of the performance of GAIA-1’s world model. The blue points are measured validation cross-entropy of smaller variants. The total compute of the model is measured in FLOPs. By fitting a power-law curve to the data points, we managed to extrapolate the validation performance for greater compute budgets. The orange dot represents the final validation cross-entropy for the GAIA-1 world model. For comparison, the smaller version of GAIA-1 from June 2023 would lie between the final two blue points used for extrapolation. Interestingly, from our extrapolation, we conclude that there is still significant room for improvement that can be obtained by scaling data and compute.
These scaling laws have been identified as a characteristic pattern in the performance and capabilities of large language models. Despite the shift in the domain from text-based language tasks to video modelling, GAIA-1 exhibits analogous trends. This suggests that as GAIA-1’s model size and training data scale up, its proficiency and performance in video generation tasks continue to improve, mirroring the scalability trends observed in large language models when applied to their respective domains.
In essence, GAIA-1’s world modelling task, focused on the next token prediction within the context of videos, shares the scaling behaviours that have become a hallmark of large language models in the realm of text and language tasks. This underscores the broader applicability of scaling principles in modern AI models across diverse domains, including autonomous driving.
Improved video generation
Elevating the capabilities of GAIA-1 through a comprehensive scaling strategy involving augmented image resolution, an increased number of parameters, extended training duration, and larger dataset size has resulted in a substantial and impressive improvement in video generation quality.
The example below shows the differences between a video generated by GAIA-1 in June 2023 and now. The videos have been generated by prompting the model to perform future video rollouts from the same video context.
We can appreciate that (i) the traffic lights are now visible in the conditioning frames, (ii) details of predicted scene elements such as vehicles and trees are much sharper, and (iii) the temporal consistency has improved.
Predicting diverse futures
The world model can predict different plausible futures from past context video frames. In this scene, we are approaching a roundabout. The first future predicted by the world model is to go straight.
In this second alternative future obtained by sampling, the model predicts a right turn.
The world model can also predict different traffic levels, including pedestrians, cyclists, motorcyclists, and oncoming traffic.
Interacting with another dynamic agent
The following example demonstrates that the world model can reason about interacting with other road users. In the first future (left video), the white vehicle reverses to give way to us. In the second future (right video), we give way to that vehicle and let it execute its right turn.
Conditioning the world model on actions
The world model can interact with its environment depending on the actions that are executed. The following video shows different possible futures conditioned on different ego trajectories.
On the left-hand side, we include mild to strong left steering followed by recovery. On the right-hand side, we include mild to strong right steering followed by recovery. These trajectories were not part of the training data and highlight the generalisation capabilities of the world model.
In this example, we force the ego-vehicle to veer off its lane by doing a right steer followed by recovery. Interestingly, we observe the oncoming vehicle reacting and making a manoeuvre to avoid a collision.
Controlling generation with text
We can control aspects of the environment by prompting the world model with text.
In the following example, we prompt the world model to generate driving scenes with the text prompt “It is” followed by either “sunny”, “rainy”, “foggy”, or “snowy”.
Time of day/illumination
In the following video, we prompt the model to generate scenes with different illuminations. The prompts are “It’s daytime, we are in direct sunlight”, “The sky is grey”, “It’s twilight”, and “It’s night”. These examples illustrate the diversity of scenes GAIA-1 can generate.
Text and action-conditioned generation
It is also possible to specify the scene with text and input the actions to navigate that particular scene. In the following example, we imagine driving behind a red bus and executing the actions to overtake it.
Generating long, diverse driving scenes
GAIA-1 can generate long, diverse driving scenes entirely from imagination, as shown in the videos below.
GAIA-1 introduces a novel approach to generative world models in the context of autonomous driving. Our research showcases the potential of multi-modal learning, integrating video, text, and action inputs to create diverse driving scenarios. GAIA-1 stands out for its ability to provide fine-grained control over ego-vehicle behaviour and scene elements, enhancing its versatility in autonomous system development. GAIA-1 uses vector quantised representations to reframe the future prediction task into a next-token prediction problem, a technique commonly employed in Language Models (LLMs). GAIA-1 has shown promise in its ability to comprehend various aspects of the world, such as distinguishing between objects like cars, trucks, buses, pedestrians, cyclists, road layouts, buildings, and traffic lights. Additionally, GAIA-1 utilises video diffusion models to generate more visually realistic driving scenes.
The application of GAIA-1 to autonomous driving has the potential to improve how we build autonomous systems, though it is important to be aware of its current limitations. To begin, our autoregressive generation process, while highly effective, requires significant processing time, making long video generation computationally intensive. It’s worth noting that this method lends itself well to parallelisation, allowing us to generate multiple samples and thereby boost overall performance efficiency simultaneously. Additionally, our current model is primarily centred around predicting single-camera outputs, though, of course, having a comprehensive view from all surrounding angles is crucial for autonomous driving. Our future endeavours will extend our model’s capabilities to encompass this broader perspective and optimise its inference efficiency. This evolution promises to make our technology even more applicable and effective.
By incorporating world models into driving models, we can enable them to better understand their own decisions and ultimately generalise to more real-world situations. Furthermore, GAIA-1 can also serve as a valuable neural simulator, allowing us to generate unlimited data for training and validating autonomous driving systems.
Qualitative comparison with prior work
Below we present a qualitative comparison between GAIA-1 and an array of other methods that are based on either GANs, diffusion models, or transformer architectures.
S. W. Kim, J. Philion, A. Torralba, and S. Fidler. “DriveGAN: Towards a Controllable High-Quality Neural Simulation”, CVPR 2021.
Video Diffusion Models
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. “Video Diffusion Models”, arXiv preprint 2022.
Flexible Diffusion Models
W. Harvey, S. Naderiparizi, V. Masrani, C. Weilbach, and F. Wood. “Flexible Diffusion Modeling of Long Videos”, NeurIPS 2022.
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. “Imagen Video: High Definition Video Generation with Diffusion Models”, arXiv preprint 2022.
L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, and L. Jiang. “MAGVIT: Masked Generative Video Transformer”, CVPR 2023.
R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. “Phenaki: Variable Length Video Generation From Open Domain Textual Description”, ICLR 2023.
Align your latents
A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. “Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models”, CVPR 2023.
X. Wang, Z. Zhu, G. Huang, X. Chen, and J. Lu. “DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving”, arXiv preprint 2023.
Runway. “Gen-2: The Next Step Forward for Generative AI”, 2023. https://research.runwayml.com/gen2.