1 March 2021 | Engineering
Emerging behaviour of our driving intelligence with end-to-end deep learning
Wayve is building the most adaptable driver, a Driving Intelligence that can learn to understand the world, yet adapt and scale intelligently to different driving domains: cities, vehicle platforms or mobility use-cases. We’ve worked hard to build the necessary foundations and are excited about deploying a solution to this problem with end-to-end deep learning.

This video shows our Driving Intelligence completing an unprotected right turn through an intersection near our London King’s Cross HQ. This is one of the hardest manoeuvres for autonomy and behaviour Wayve has been able to learn with end-to-end deep learning. Unlike other approaches, we learn to drive from data using camera-first sensing without needing an HD-map.
We train our system to understand the world around it with computer vision and learn to drive with imitation and reinforcement learning. In this example, our Driving Intelligence is able to navigate the complex lane layout, avoiding the car which runs the red light and passing the pedestrians with human-like confidence. We have not mapped out this road or explicitly specified our system to induce this behaviour: it’s simply present in our natural driving data. This is what deep learning is capable of achieving.
What do we mean by end-to-end deep learning?
We learn to drive with end-to-end deep learning, from sensory input to motion plan output. We form hierarchical layers of abstraction and are able to utilise gradient signals throughout to optimise the entire neural network. The input to our system is a video stream from 6 monocular cameras and some supporting sensory and ordinary sat nav information. Our neural network contains tens of millions of parameters and learns to regress a motion plan, which a controller is able to actuate on the vehicle.

Additionally, we produce a number of intermediate outputs that we use for development, interpretability and safety verification. These intermediate representations are not features directly used in the model, but rather decoded from intermediate latent states as auxiliary outputs and training targets. This preserves the flexibility of high dimensional representations, while accelerating performance by providing additional learning signals and semantic inductive biases. Some of these auxiliary outputs are learned from labelled data (semantic segmentation and traffic light state detection) while most are self-supervised, and obtained directly from the vast quantities of unlabelled data available (e.g. depth, geometry, motion and future prediction).
We learn from a number of different signals and data sources with multi-task training.
- Imitation learning from expert policy data
- Reinforcement learning from safety driver intervention during on-policy testing
- Safety driver corrective action following an intervention
- Modelling dynamics and future state prediction from off-policy data
- Intermediate computer vision representations of semantics, motion and geometry using supervised and self-supervised learning
Combining these learning objectives and learning from petabytes of driving data is what we refer to as Fleet Learning. From here, we intend to get even more out of the abundant unlabelled data by deploying more self-supervised approaches, for example by using methods for unsupervised object discovery or similarity with contrastive learning to reduce the need for labelled segmentation data.
Why is end-to-end deep learning a good approach for autonomous driving?
Autonomous driving, especially in complex urban environments, is a hugely challenging task. The input is complex and high dimensional, the demands for safety are high and require robustness while rich and complex urban environments require strong generalisation in both perception and behaviour. We are not aware of any technologies in the world better suited for this task than deep learning.
Historically, deep learning has been challenging to debug and hard to scale, requiring enormous amounts of data. However, thanks to our partnership with Microsoft Azure and a strategy to amass fleet scale data, along with rapid and plentiful advances in the field of machine learning, we feel that now is the time to tackle this. We believe fully learning-driven methods will be the future of autonomous mobility and the progress we share here is strong evidence towards their near-term commercial viability.
In particular, the advantages of this method include:
- Scalable unit economics, with lean compute and sensor requirements for model inference on the vehicle, with most of the cost shifted to large scale training of deep learning models.
- Ability to operate without the cost of an HD-map, giving a step-change in scalability and adaptability. We can operate in any UK city despite our experience predominantly being within central London.
- No requirement for human labelling or hand crafting of motion plans through learning. For example, in the videos below, we learnt to pass double parked cars without ever designing a double-parked-vehicle detection system.
- Increasing performance with scale of complexity. As we get more data or experience more edge cases, our system does not become more brittle like a rule-based system might. Rather, our system optimises a better representation able to abstract away more general concepts with deep learning, resulting in better generalisation and improved driving intelligence.
What behaviours can we learn?
We train our Driving Intelligence in Microsoft Azure over hundreds of millions of data-points using distributed training with many GPU compute nodes. With this scale of data, we are now seeing reliable behaviour of our system in some driving tasks that many humans find stressful! What’s particularly exciting is that we haven’t needed to hand-craft individual rules and policies to implement each of these behaviours, such as navigating road works, reacting to traffic lights or avoiding double-parked-vehicles. Rather, our end-to-end learning approach has learnt to generalise to these scenarios from our training data.
Here’s our system driving through a few roundabouts in North London.
Here we pass two double parked vehicles on a road with frequent traffic islands. This is a very common occurrence in central London.
Here’s some of the narrow and tight navigation that is often required when driving on London’s busy streets. We navigate with precision past these vehicles using only monocular cameras. Learning metric geometric representations from monocular computer vision has developed substantially in recent years.
Often roadworks change the way we need to navigate and make it difficult to rely on a pre-existing HD-map. Here we understand and navigate through a road working scenario using online scene understanding from our computer vision.
And finally an example of safe navigation past pedestrian crossings.
Of course, autonomous driving requires more than demonstrations of intelligence, needing statistical evidence of repetitive robustness. To read more about how we test and measure the performance of our autonomous vehicle, see our blog series here. How do we ensure our system is acceptably safe? We’ll be sharing more about our strategy for building a safety case soon.