25 June 2018  |  Research

Learning to drive in a day

The first example of deep reinforcement learning on-board an autonomous car

Do you remember learning to ride a bicycle as a child? Excited and mildly anxious, you probably sat on a bicycle for the first time and pedalled while an adult hovered over you, prepared to catch you if you lost balance. After some wobbly attempts, you perhaps managed to balance for a few metres. Several hours in, you probably were zipping around the park on gravel and grass alike.

Wayve AV2.0 learning to drive in a day with reinforcement learning road – person driving a car

The adult would have only given you brief tips along the way. You did not need a dense 3D map of the park nor a high fidelity laser on your head. You did not need a long list of rules to follow to be able to balance on the bicycle. The adult simply gave you a safe environment for you to learn how to map what you see to what you should do, to successfully ride a bicycle.

Today’s self-driving cars have been packed with a large array of sensors, and are told how to drive with a long list of carefully hand-engineered rules through slow development cycles. In this blogpost, we go back to basics, and let a car learn to follow a lane from scratch, with clever trial and error, much like how you learnt to ride a bicycle. Have a look at what we did:

Please change your cookie settings to view embedded content, or view this video on YouTube.

In just 15-20 minutes, we were able to teach a car to follow a lane from scratch, only by using when the safety driver took over as training feedback.

No dense 3D map. No hand-written rules.

This is the first example where an autonomous car has learnt online, getting better with every trial. So, how did we do it?

We adapted a popular model-free deep reinforcement learning algorithm (deep deterministic policy gradients, DDPG) to solve the lane following task. Our model input was a single monocular camera image. Our system iterated through 3 processes: exploration, optimisation and evaluation.

Our network architecture was a deep network with 4 convolutional layers and 3 fully connected layers with a total of just under 10k parameters. For comparison, state of the art image classification architectures have 10s of millions of parameters.

Diagram showing Wayve's Exploration, Optimisation and Evaluation

All processing was performed on one graphics processing unit (GPU) on-board the car.

Working on a real robot in a dangerous real environment poses many new problems. In order to better understand the task at hand and find suitable model architectures and hyperparameters, we did a lot of testing in simulation.

Above is an example of our lane following simulated environment shown from different angles. The algorithm only sees the driver perspective i.e. the image with the teal border. At every episode, we randomly generate a curved lane to follow, as well as the road texture and lane markings. The agent explores until it leaves the lane, when the episode terminates. Then the policy optimises based on collected data and we repeat.

Chart showing Distance travelled by the car before a safety driver takeover against number of exploration episodes

We used simulated tests to try out different neural network architectures and hyperparameters until we found settings which consistently solved the task of lane following in very few training episodes i.e. with little data. For example, one of our findings was that training the convolutional layers using an auto-encoder reconstruction loss significantly improved stability and data-efficiency of training. See our full technical report for more details.

The potential implications of our approach are huge

Imagine deploying a fleet of autonomous cars, with a driving algorithm which initially is 95% the quality of a human driver. Such a system would not be wobbly like the randomly initialised model in our demonstration video, but rather would be almost capable of dealing with traffic lights, roundabouts, intersections, etc. After a full day of driving and on-line improvement from human-safety driver take over, perhaps the system would improve to 96%. After a week, 98%. After a month, 99%. After a few months, the system may be super-human, having benefited from the feedback of many different safety drivers.

Today’s self-driving cars are stuck at good but not good enough performance levels. Here, we have provided evidence for the first viable framework to quickly improving driving algorithms from being mediocre to being roadworthy. The ability to quickly learn to solve tasks through clever trial and error is what has made humans incredibly versatile machines capable of evolution and survival. We learn through a mixture of imitation, and lots of trial and error for everything from riding a bicycle, to learning how to cook.

DeepMind have shown us that deep reinforcement learning methods can lead to super-human performance in many games including Go, Chess and computer games, almost always outperforming any rule based system. We here show that a similar philosophy is also possible in the real world, and in particular, in autonomous vehicles. A crucial point to note is that DeepMind’s Atari playing algorithms required millions of trials to solve a task. It is remarkable that we consistently learnt to lane-follow in under 20 trials.

We learnt to follow lanes from scratch in 20 minutes.

Imagine what we could learn to do in a day…?

Wayve has a philosophy that to build robotic intelligence we do not need massive models, fancy sensors and endless data. What we need is a clever training process that learns rapidly and efficiently, like in our video above. Hand-engineered approaches to the self-driving problem have reached an unsatisfactory glass ceiling in performance. Wayve is attempting to unlock autonomous driving capabilities with smarter machine learning.

With special thanks

We would like to thank StreetDrone for building us an awesome robotic vehicle, Admiral for insuring our vehicle trials and the Cambridge Polo Club for granting us access to their private land for our lane-following research.

 

Download the full paper