20 April 2021  |  Research

Predicting the future from monocular cameras in bird’s-eye view

Driving can often result in situations where the immediate decision is not obvious if you only take a single snapshot of the world. For example, what should a driver do in the following situations?

Predicting the Future from Monocular Cameras in Bird’s-Eye View – street scenes

Is the vehicle on the left yielding?

Is the vehicle on the left yielding? This situation is ambiguous because even if we have the right of way, people do periodically run those intersections instead of yielding.
With more temporal context, it becomes obvious that the vehicle was indeed giving way, and we can safely continue to drive.

Can we drive through the intersection?

In this second scenario, should we yield to the van and the cyclist as we’re supposed to?

We can confidently predict from their motion that they are turning off the road, meaning we can make the turn.

To safely drive with other vehicles on the road, it is not only necessary to localise where they are in the present but also predict where they will be in the future based on how they have moved so far. Making these types of predictions is intuitive for humans but it is still very challenging for autonomous systems. Prediction is a key part of being able to drive safely and effectively with other road users.

This blog post covers some of our recent developments for prediction, demonstrating, for the first time, an algorithm capable of predicting the future in bird’s eye view from monocular video cameras. Prediction as a task isn’t new: it’s been worked on for well over a decade in pursuit of autonomous driving. It’s still regarded as one of the most challenging aspects in an autonomous driving system, as it has a significant impact on behavioural planning. Our belief is that a fully learned driver will ultimately be what gets us to a self-driving reality, and end-to-end learned prediction is a part of this. At Wayve we build our driving system around a learned algorithm which estimates where to drive based on input sensor data, where prediction is one of the inductive biases we use to learn driving effectively and efficiently.

We’ve published a paper and released the source code. This blog post follows our previous work.

Our approach

In autonomous driving, the goal is to navigate a vehicle safely and correctly through 3D space. As such, an orthographic bird’s-eye view (BEV) perspective is commonly used for motion planning and prediction based on LiDAR sensing.

Over recent years, we’ve seen advances in camera-based perception rival LiDAR-based perception, and we anticipate that this will also be possible for wider monocular vision tasks, including prediction. Building a perception and prediction system based on cameras would enable a leaner, cheaper and higher resolution visual recognition system over LiDAR sensing.

We present FIERY: a future instance prediction model in bird’s-eye view from monocular cameras only. Our model predicts future instance segmentation and motion of dynamic agents that can be transformed into non-parametric future trajectories.

Our approach combines the perception, sensor fusion and prediction components of a traditional autonomous driving stack end-to-end, by estimating bird’s-eye-view prediction directly from surround RGB monocular camera inputs. We favour an end-to-end approach as it allows us to directly optimise our representation, rather than decoupling those modules in a multi-stage discrete pipeline of tasks which is prone to cascading errors and high-latency.

Further, classical autonomous driving stacks tackle future prediction by extrapolating the current behaviour of dynamic agents, without taking into account possible interactions. They rely on HD maps and use road connectivity to generate a set of future trajectories. In contrast, FIERY learns to predict future motion of dynamic agents directly from camera driving data in an end-to-end manner, without relying on HD maps or LiDAR sensing. It can reason about the inherent stochastic nature of the future, and predicts multimodal future trajectories as shown in the video below.

Multimodal future predictions by our bird’s-eye view network. Top two rows: RGB camera inputs. The predicted future trajectories and segmentations are projected to the ground plane in the images. Bottom row: future instance prediction in bird’s-eye view in a 100m×100m capture size around the ego-vehicle, which is indicated by a black rectangle in the center.

Future predictions


Our model predicts multimodal future trajectories for road agents corresponding to different plausible futures (continuing straight, turning left, turning right, changing lane). Watch how the ego-vehicle and multi-agent future trajectories are holistically predicted, with multiple paths representing the distribution of possible futures.

Overtaking static vehicles

At the beginning of the sequence, our model predicts that the vehicle in front will either return to the left lane (left-side driving), or swerve to the right. In the next moments, we understand that this vehicle in fact has to overtake static vehicles parked on the left lane.


Our model can predict complex behaviours in urban environments, such as the U-turn of the vehicle in front of us.

Model architecture

Diagram showing Wayve's model architecture The 6 building blocks of FIERY.
  1. At each past timestep {1,…,t}, we lift camera inputs to 3D by predicting a depth probability distribution over pixels and using known camera intrinsics and extrinsics.
  2. These features are projected to bird’s-eye view. Using past ego-motion, we transform the bird’s-eye view features into the present reference frame (time t) with a Spatial Transformer module S.
  3. A 3D convolutional temporal model learns a spatio-temporal state s_t.
  4. We parametrise two probability distributions: the present and the future distribution. The present distribution is conditioned on the current state s_t, and the future distribution is conditioned on both the current state s_t and future labels.
  5. We sample a latent code from the future distribution during training, and from the present distribution during inference. The current state s_t and the latent code are the inputs to the future prediction model that recursively predicts future states.
  6. The states are decoded into future instance segmentation and future motion in bird’s-eye view.

The secret recipe

Future prediction performance for different ablations of our model, reporting the future Video Panoptic Quality metric. Future prediction performance for different ablations of our model, reporting the future Video Panoptic Quality metric.

Learning a spatio-temporal state

Predicting the future requires understanding the past. However, learning correspondences and motion from past camera inputs can be tricky as the ego-vehicle is also moving. As shown in the figure above, the two largest performance gains come from (i) having a temporal model to incorporate past context, and (ii) lifting past features to the common reference frame of the present. When past features are in a common reference frame (and therefore ego-motion is factored out) the task of learning correspondences and motion of dynamic agents becomes much simpler.

Predicting future states

When predicting the future, it is important to model its sequential nature, i.e. the prediction at time t+1 should be conditioned on the prediction at time t.

The “no unrolling” variant which directly predicts all future instance segmentations and motions from the current state s_t, results in a large performance drop. This is because the sequential constraint is no longer enforced, in contrast to our approach that predicts future states in a recursive way.

Probabilistic modelling

The future is inherently uncertain, and different outcomes are probable given a unique and deterministic past. By blindly penalising the model with the ground-truth future labels, which corresponds to one of the possible futures, our model is not encouraged to learn the different modes of the futures.

We introduce two probability distributions:

  1. The future distribution is used during training, and is conditioned on the current state s_t as well as future labels. We sample a latent code from the future distribution to guide the future prediction module to the observed future of the training sequence. The future prediction module cannot cheat and simply copy the future labels because the future distribution heavily bottlenecks the information to a low-dimensional vector.
  2. The present distribution is only conditioned on the current state s_t. It is encouraged to capture the different modes of the future with a mode-covering Kullback-Leibler divergence loss of the future distribution with respect to the present distribution. During inference, different latent code samples from the present distribution will correspond to different plausible futures.

Future work

Autonomous driving requires decision making in multimodal scenarios, where the present state of the world is not always sufficient to reason correctly alone. Predictive models estimating the future state of the world – particularly other dynamic agents – are therefore a key component to robust driving. FIERY is a probabilistic future prediction model in bird’s-eye view from monocular cameras that predicts multimodal future trajectories of road agents in an end-to-end manner.

In future work, we will be jointly training a driving policy to condition the future prediction model on future actions. Such a framework would enable effective motion planning in a model-based reinforcement learning setting.

Read the full research paper​

View our source code

Back to top