2 December 2019 | Research
Learned urban driving
Earlier this year in February 2019 we demonstrated the first end-to-end learned driving system with full vehicle control, following a user-defined route. This was an exciting step for the development of AI for mobile robotics, as we showed that a different approach to the huge challenge of autonomous driving is possible: one which does not depend on infrastructure such as HD-maps, or costly sensor payloads, yet can do complex driving on a previously unseen route (including interaction with traffic).

In this post, we’d like to share some of the details which lead us to this proof point. For further details, we have released a preprint article on arXiv which we will be presenting at the ML4AD workshop at NeurIPS.
Since showing this technical world first, we have continued growing as a company (see the news articles in The Times and VentureBeat). We’ve successfully closed a fundraising round, moved our headquarters from Cambridge to London, we’ve fully rebased our technology development from a modified Renault Twizy to a fleet of Jaguar I-PACE electric vehicles, and we’ve continued to push the technical envelope for state-of-the-art learned autonomous driving.

How we structured the problem
Machine learning methods can broadly be categorised as either supervised learning, unsupervised learning,or reinforcement learning. In the context of learning a control policy, the primary learning signal could either come from expert demonstration data, or from on-policy feedback and intervention data. In our very first technical demonstration, we showed that we can use reinforcement learning to learn to lane follow from scratch using sparse safety driver interventions.
Here we focused on learning by copying human driving, in an approach that is often referred to as imitation learning. An extension to this is conditional imitation learning, which trains a neural network to perform different actions given some input in addition to the observed state. We use this to direct the car to follow a user-defined route.
In short, this means collecting human driving data and training a machine learning model to copy this behaviour. Specifically, we learn a simple motion plan directly from RGB images, and execute that plan with the car.

We trained the system outlined in the figure above with just 30 hours of driving data, and ran 35km of autonomous testing on two urban routes which did not feature in our training data. We also ran a suite of driving tests where we forced the car to interact with an adversarial pace car, causing the test car to stop on demand while driving the test route.
This system is the first to demonstrate learned full vehicle control (i.e., both lateral and longitudinal), following a user prescribed route in urban environments interacting with other traffic.
We were inspired by a number of previous pieces of work:
- Nvidia’s demonstration of steering-only control and interpretability with end-to-end deep learning on novel roads without traffic,
- Demonstration of full vehicle control with conditional imitation learning in simulation and constrained environments with a toy RC car,
- Demonstration of learning steering-only control and localisation prediction with a richer learned route embedding,
- Learned approaches to parts of typical autonomous vehicle software stacks such as motion planning,
…and more.
What did we learn?
We learned a huge amount about this fundamentally new approach to autonomous driving. We took away three key conclusions, and a whole suite of ideas for improvements.
1) The data distribution is hugely important
Driving data is not uniformly distributed, and any naive training will collapse to the dominant mode present in the data. The net result of naive training is a model which will drive straight, at the average speed. The distribution of the 30 hours of training data we used are visualised in the heat map and histograms below: the amount of time we spend driving straight and sitting in stationary traffic is substantial!


We addressed this issue by applying data-driven balancing to the training set, ensuring that we train the model to respond equivalently to different driving manoeuvres. This method removes the need for the data augmentation methods used in previous work (using offset cameras during data collection to synthesise corrective behaviour).
2) Robust computer vision representations matter
Achieving robust performance in the real-world, with real images, is challenging. A large part of this challenge can be attributed to the diversity in data due to appearance change: the same scene can look very different based on time of day, weather, and many other factors. This diversity challenges data driven approaches, as it is very easy for a learned model to overfit to the training set and fail to generalise to the test set. In robotics, the ‘test’ data is the real-world, not a static dataset as is typical in most ML problems. Every time our cars go out, the world is new and unique.
It is extremely important that the models we train are able to generalise to the real-world. In principle, with infinite data we could learn a driving system that does generalise well to the real-world. This approach is what we see in the work done in simulation by DeepMind with AlphaStar, and with OpenAI in learning to manipulate objects with a robot which can be left running for hours unattended. In these situations it is possible to generate massive amounts of training data for the task at hand. Pragmatically, this doesn’t apply to most problems in the real-world.
We address this problem by turning to computer vision.

We encourage robustness in the high dimensional representation for driving (Z in the first figure) by training the primary shared encoders and auxiliary independent decoders for a number of computer vision tasks, each of which provides information that we know is useful for driving:
- What is around me? Semantic segmentation.
- What is the geometry of the scene? Monocular depth estimation.
- How are different parts of the scene moving? Optical flow estimation.
Our real-world testing indicated that this learned vision representation is critical to achieving real-world driving efficiently.
3) Use multiple learning signals for data efficiency
Field robotics is extremely time consuming, requiring a large amount of forward planning to ensure testing is efficient and effective. One part of this problem can be mitigated by having effective field test logistics. We like Prof. Tim Barfoot’s principles for field trials in this blog. In an end-to-end context, field trials include both data collection and closed-loop evaluation.
In order to get the most of our vehicle time in both training and testing, we prioritise learning efficiently. This means making a judgement on vehicle usage, being rigorous in testing, and working to get the most out of the data we do have. Prior work in this area has been trained fully end-to-end, mapping RGB pixels to vehicle actions. RGB data is high entropy, hence it is difficult to learn a robust control policy in a data efficient manner when doing this. In these situations, some authors heavily crop images, or constrain the problem to simulation.
In particular, we found that data efficiency can be improved with multiple learning signals. An example here is the learning signal in the computer vision task we train for. However, this is not limited to a single task. Examples of learning signals available to us include:
- Imitation of human expert drivers (supervised learning),
- Safety driver intervention data (negative reinforcement learning) and corrective action (supervised learning),
- Geometry, dynamics, motion and future prediction (self-supervised learning),
- Labelled semantic computer vision data (supervised learning),
- Simulation (supervised learning).
As a result of this additional learning signal, we were able to learn a useful driving policy for simple urban driving with 30 hours of labelled control data. This is a comparatively small amount of driving data, and we attribute this partly to pretraining a learned computer vision representation. Humans have 16-17 years of learned visual prediction models prior to learning to drive, yet typically have only ~30 hours of driver training prior to getting a learner’s licence.
What was exciting to us was considering the impact of data on performance. We degraded model performance by reducing the training corpus, and looked at the influence of data on intervention rates. We know deep learning scales very well with data, often with a logarithmic relationship. We observed a very strong correlation in our data too.

This is exciting, as it gives us a clear signal that increased data diversity is likely to improve performance. We’re not at performance levels near a human driver today, nor do we know the performance asymptote, but this indicates that we have two levers to pull on for increasing performance: 1) algorithmic innovation, and 2) data.
What new challenges are we focusing on?
The system we tested here showed a key proof point that these learned systems can indeed work. There are many interesting challenges ahead for this technology. In particular, we believe the following are needed to get learned driving systems to the point where they could rival (or exceed) human driving performance.
1. Predictive and temporal model capacity
Applying temporal prediction to multimodal scene understanding is essential for driving, for example, assessing potential traffic behaviour at an intersection, or driving safely with an occluded cyclist.
2. Efficient offline evaluation of driving performance
Currently, open loop metrics (i.e., evaluating on an offline dataset) are considered to be poor indicators of closed loop performance (i.e., evaluating by closing the control loop). The huge progress we have seen in machine learning and computer vision over the past few years is largely attributable to the presence of datasets and effective benchmarks: with a benchmark in place, the research community is extremely good at solving a challenge. Examples include ImageNet, Cityscapes, and others. Currently, no such performance benchmarks exist for learned visual control for autonomous driving, or any other similar problem in visual control. The CARLA simulator is making great progress towards this, but as a research community we don’t yet have a standardised benchmark which is fully representative of the real-world driving problem. We see this as a key gap that needs to be addressed.
3. Learning from both demonstration and mistakes
Here we focused solely on learning from human demonstrations, whereas previously we demonstrated the ability to learn solely from intervention data. In practice, both approaches are required to efficiently learn a robust control policy.
Stay tuned for further updates on our work in these areas!