20 April 2021 | Research
Predicting the future from monocular cameras in bird’s-eye view
Driving can often result in situations where the immediate decision is not obvious if you only take a single snapshot of the world. For example, what should a driver do in the following situations?

Is the vehicle on the left yielding?

Can we drive through the intersection?

In this second scenario, should we yield to the van and the cyclist as we’re supposed to?
We can confidently predict from their motion that they are turning off the road, meaning we can make the turn.
To safely drive with other vehicles on the road, it is not only necessary to localise where they are in the present but also predict where they will be in the future based on how they have moved so far. Making these types of predictions is intuitive for humans but it is still very challenging for autonomous systems. Prediction is a key part of being able to drive safely and effectively with other road users.
This blog post covers some of our recent developments for prediction, demonstrating, for the first time, an algorithm capable of predicting the future in bird’s eye view from monocular video cameras. Prediction as a task isn’t new: it’s been worked on for well over a decade in pursuit of autonomous driving. It’s still regarded as one of the most challenging aspects in an autonomous driving system, as it has a significant impact on behavioural planning. Our belief is that a fully learned driver will ultimately be what gets us to a self-driving reality, and end-to-end learned prediction is a part of this. At Wayve we build our driving system around a learned algorithm which estimates where to drive based on input sensor data, where prediction is one of the inductive biases we use to learn driving effectively and efficiently.
We’ve published a paper and released the source code. This blog post follows our previous work.
Our approach
In autonomous driving, the goal is to navigate a vehicle safely and correctly through 3D space. As such, an orthographic bird’s-eye view (BEV) perspective is commonly used for motion planning and prediction based on LiDAR sensing.
Over recent years, we’ve seen advances in camera-based perception rival LiDAR-based perception, and we anticipate that this will also be possible for wider monocular vision tasks, including prediction. Building a perception and prediction system based on cameras would enable a leaner, cheaper and higher resolution visual recognition system over LiDAR sensing.
We present FIERY: a future instance prediction model in bird’s-eye view from monocular cameras only. Our model predicts future instance segmentation and motion of dynamic agents that can be transformed into non-parametric future trajectories.
Our approach combines the perception, sensor fusion and prediction components of a traditional autonomous driving stack end-to-end, by estimating bird’s-eye-view prediction directly from surround RGB monocular camera inputs. We favour an end-to-end approach as it allows us to directly optimise our representation, rather than decoupling those modules in a multi-stage discrete pipeline of tasks which is prone to cascading errors and high-latency.
Further, classical autonomous driving stacks tackle future prediction by extrapolating the current behaviour of dynamic agents, without taking into account possible interactions. They rely on HD maps and use road connectivity to generate a set of future trajectories. In contrast, FIERY learns to predict future motion of dynamic agents directly from camera driving data in an end-to-end manner, without relying on HD maps or LiDAR sensing. It can reason about the inherent stochastic nature of the future, and predicts multimodal future trajectories as shown in the video below.
Future predictions
Intersection
Our model predicts multimodal future trajectories for road agents corresponding to different plausible futures (continuing straight, turning left, turning right, changing lane). Watch how the ego-vehicle and multi-agent future trajectories are holistically predicted, with multiple paths representing the distribution of possible futures.
Overtaking static vehicles
At the beginning of the sequence, our model predicts that the vehicle in front will either return to the left lane (left-side driving), or swerve to the right. In the next moments, we understand that this vehicle in fact has to overtake static vehicles parked on the left lane.
U-turn
Our model can predict complex behaviours in urban environments, such as the U-turn of the vehicle in front of us.
Model architecture

- At each past timestep {1,…,t}, we lift camera inputs to 3D by predicting a depth probability distribution over pixels and using known camera intrinsics and extrinsics.
- These features are projected to bird’s-eye view. Using past ego-motion, we transform the bird’s-eye view features into the present reference frame (time t) with a Spatial Transformer module S.
- A 3D convolutional temporal model learns a spatio-temporal state s_t.
- We parametrise two probability distributions: the present and the future distribution. The present distribution is conditioned on the current state s_t, and the future distribution is conditioned on both the current state s_t and future labels.
- We sample a latent code from the future distribution during training, and from the present distribution during inference. The current state s_t and the latent code are the inputs to the future prediction model that recursively predicts future states.
- The states are decoded into future instance segmentation and future motion in bird’s-eye view.
The secret recipe

Learning a spatio-temporal state
Predicting the future requires understanding the past. However, learning correspondences and motion from past camera inputs can be tricky as the ego-vehicle is also moving. As shown in the figure above, the two largest performance gains come from (i) having a temporal model to incorporate past context, and (ii) lifting past features to the common reference frame of the present. When past features are in a common reference frame (and therefore ego-motion is factored out) the task of learning correspondences and motion of dynamic agents becomes much simpler.
Predicting future states
When predicting the future, it is important to model its sequential nature, i.e. the prediction at time t+1 should be conditioned on the prediction at time t.
The “no unrolling” variant which directly predicts all future instance segmentations and motions from the current state s_t, results in a large performance drop. This is because the sequential constraint is no longer enforced, in contrast to our approach that predicts future states in a recursive way.
Probabilistic modelling
The future is inherently uncertain, and different outcomes are probable given a unique and deterministic past. By blindly penalising the model with the ground-truth future labels, which corresponds to one of the possible futures, our model is not encouraged to learn the different modes of the futures.
We introduce two probability distributions:
- The future distribution is used during training, and is conditioned on the current state s_t as well as future labels. We sample a latent code from the future distribution to guide the future prediction module to the observed future of the training sequence. The future prediction module cannot cheat and simply copy the future labels because the future distribution heavily bottlenecks the information to a low-dimensional vector.
- The present distribution is only conditioned on the current state s_t. It is encouraged to capture the different modes of the future with a mode-covering Kullback-Leibler divergence loss of the future distribution with respect to the present distribution. During inference, different latent code samples from the present distribution will correspond to different plausible futures.
Future work
Autonomous driving requires decision making in multimodal scenarios, where the present state of the world is not always sufficient to reason correctly alone. Predictive models estimating the future state of the world – particularly other dynamic agents – are therefore a key component to robust driving. FIERY is a probabilistic future prediction model in bird’s-eye view from monocular cameras that predicts multimodal future trajectories of road agents in an end-to-end manner.
In future work, we will be jointly training a driving policy to condition the future prediction model on future actions. Such a framework would enable effective motion planning in a model-based reinforcement learning setting.