Learning a World Model and a Driving Policy
From an early age, humans and other animals start building internal representations, or models, of the world through observation and interaction. This accumulated knowledge of the world, part of what we often refer to as ‘common sense’, allows us to navigate effectively in unfamiliar situations. These models that describe the evolution of the world around us (or world models) are paramount to how we act in our everyday lives.
Imitation learning is a method that allows machine learning models to learn to mimic human behaviour on a given task. An example is learning to drive a vehicle in an urban environment from expert demonstrations collected by human drivers. Data collected by human drivers implicitly contains common knowledge of the world. To account for this knowledge, we believe it is essential to build world models. Incorporating world models into our driving models is key to enabling them to properly understand the human decisions they are learning from and ultimately generalise to more real-world situations.
Building on top of our work in future prediction and dreaming about driving, in our paper “Model-Based Imitation Learning for Urban Driving”, we present an approach, named MILE, that combines world modelling with imitation learning. MILE jointly learns a model of the world and a driving policy from an offline corpus of driving data. It can imagine and visualise diverse and plausible futures and use this ability to also imagine how to plan its future actions. Our model achieves state-of-the-art performance in the CARLA driving simulator. The biggest improvement upon previous methods (35% increase in driving score) was observed when deploying our approach in new towns and under previously unseen weather conditions. This result highlights how MILE can achieve better generalisation than previous methods.
MILE’s main components are:
- Observation encoder. Since autonomous driving is a geometric problem where it is necessary to reason in 3D about the static environment and dynamic agents, we first lift the image features to 3D. The 3D feature voxels are then sum-pooled to Bird’s-Eye View (BEV) using a predefined grid. Even after sum-pooling the voxel features to BEV, the high dimensionality of the BEV features is prohibitive for a probabilistic world model. Therefore, using a convolutional backbone, we further compress the BEV features to a one-dimensional vector.
- Probabilistic modelling. The world model is trained to match the distribution of the prior distribution (a guess of what will happen after the executed action) to the posterior distribution (the evidence of what actually happened).
- Decoders. The observation decoder and the BEV decoder have an architecture similar to StyleGAN [Karras et al. 2019]. The prediction starts as a learned constant tensor, and is progressively upsampled to the final resolution. At each resolution, the latent state is injected into the network with adaptive instance normalisation. This allows the latent states to modulate the predictions at different resolutions. Additionally, the driving policy outputs the vehicle control.
- Temporal modelling. Time is modelled by a recurrent network that models the latent dynamics, predicting the next latent state from the previous latent state.
- Imagination. From this observed past context, the model can imagine future latent states and use them to plan and predict actions using the driving policy. Future states can also be visualised and interpreted through the decoders.
From past observations, our model can imagine plausible diverse futures and plan different actions based on the predicted future. We demonstrate this with an example of MILE approaching an intersection. The traffic light is green, and we are following a vehicle.
Predicting diverse and plausible futures
In Predicted scenario 1, the traffic light stays green. Correctly, MILE decides to continue driving and follows the vehicle in front.
Predicted scenario 1 (green light)
In the visualisation above, we show
- the camera observation (input to the model),
- the ground truth bird’s-eye view segmentation (not an input to the model),
- the predicted bird’s-eye view segmentation,
- the vehicle controls (outputs from the model).
In the segmentation visualisations, the colour grey indicates the drivable area and the lane markings, black represents the ego-vehicle, blue represents the other vehicles and light blue indicates pedestrians. The stop line of a traffic light is represented by a line perpendicular to the traffic direction, and the colour of the line represents the traffic light state. When the camera observation and ground truth semantic segmentation freeze, it indicates that the model started to plan control actions (based on the observed past) and to imagine the respective future states. Ten seconds of predicted actions and future bird’s-eye view segmentation are displayed.
In Predicted scenario 2, our model imagines that the light will switch from green to amber and then to red. While an alternative future, the model still correctly responds to the traffic light state. We can also observe that the vehicle in front of the ego-agent managed to pass the stop line in time and continued to drive through the intersection.
Predicted scenario 2 (red light)
Interestingly we notice how MILE understands the logic of traffic lights in the intersection. In the first example, while the traffic light for the ego-agent is green, the one at the other branch of the intersection is red. In the second example, we can see that the light of the other branch becomes green after the ego-agent traffic light switched to red.
Controlling the vehicle
MILE has the ability to imagine potential futures, such as the movement of other vehicles and pedestrians, the consequences of our own actions and the state of the world. This ability allows our model to reason about the possibilities and select the best, most safe and effective action. We found that this ability improves the basic driving performance of our model and enables it to successfully tackle advanced situations, as shown in the examples below.
Stopping for pedestrians
Our model stops for pedestrians twice: once as the traffic light turns green, and a second time when making the right turn.
Making sharp turns
Slowing for traffic and stopping for red light
Note that prior to this example, the model had never been exposed to night time driving during training. Yet, it is able to generalise and handle urban traffic scenarios such as slowing for traffic and stopping for a red light.
Driving in imagination
The ability of MILE to imagine plausible futures and plan actions accordingly allows the model to be able to control the vehicle in imagination. This means that the model can successfully control the vehicle without having access to the most recent observations of the world. This is similar to what happens in the real world when, for example, a driver temporarily loses sight due to sun glare or by sneezing. In both scenarios, the human driver will lose their ability to observe the world but can nevertheless still control the vehicle. To further demonstrate the power and the accuracy of the predictions, we force the model to drive not with camera observations but only with its imagined future states. Even in this extreme pressure test, our model can drive safely and effectively, executing manoeuvres like swerving to avoid a motorcycle, stopping behind other vehicles and exiting a roundabout. These examples demonstrate that the imagined futures are not just realistic and plausible but safe, effective and imminently helpful for actual driving.
In the examples below, we simulate this loss of ability to perceive the environment by removing the possibility for MILE to obtain any new observation. We alternate between this mode and the default mode every two seconds. When MILE is executing the actions that are planned in imagination, the word “IMAGINING” appears on the video and the camera observation becomes sepia coloured.
Stopping behind other agents
Our model anticipates the need to slow down and stop for stationary agents and executes this plan from imagined states and actions.
Starting at a green light
In this example, we are stopped at a red light. As it turns green, we switch to the imagining mode and successfully navigate the intersection from an imagined plan.
Negotiating a roundabout
Our model demonstrates complex driving manoeuvres in this roundabout.
Marking a pause at a stop line
Here, MILE marks a pause at the stop line and resumes driving afterwards, showing evidence of its temporal memory.
Swerving to avoid a motorcyclist
In this last example, our model predicts a manoeuvre to successfully avoid a motorcyclist that suddenly drives into our lane. This is further evidence that MILE can reason on the future movements of dynamic agents and plan accordingly.
World modelling is one of the ways Wayve is reimagining the problem of self-driving using machine learning. What makes our technology unique is how we leverage AI, and more specifically end-to-end deep learning, to develop the on-board ‘driving intelligence’ to enable any electric vehicle to drive autonomously, even in places it’s never previously been to. Much like how people learn to drive but with the benefits of greater reliability and safety. Our technology is trained on vast amounts of driving data to continually improve driving performance and enable our system to adapt to changes on the road. We are showcasing the promise of our AI-powered technology via daily testing on public roads across the country on electric cars and light commercial vans. We are designing our technology to be scalable and it's our ambition to be the first to deploy in 100 cities globally.
- If you’d like to work on these ideas with us, we are hiring!
- Read the full research paper on arXiv
- Checkout the source code on github
- Follow us on Twitter and LinkedIn