10 January 2023  |  Research

Evaluating driving performance in diverse simulated worlds

Model evaluation is a big unsolved problem in the end-to-end self-driving field, commonly referred to as AV2.0. We present our latest research on how we are simulating the diversity of the real-world dynamic behaviour on our streets using multi-agent reinforcement learning in order to improve our ability to evaluate models.

Wayve's Jaguar I-Pace driving through the streets of London

At the heart of the autonomous driving challenge is the ability to understand and act safely across the full complexity of the “open world”. Where does this complexity come from? Us. Humans can be unpredictable, inconsistent, and sometimes unsafe as we drive, cycle, walk, or scoot. Even without noticing it, we constantly interact with other people as we move around. Our actions are shaped by our own goals, conventions, laws, safety, and other considerations. The complex behaviour that emerges from these interactions makes the real world such a challenging environment for autonomous vehicles to operate in: we need our automated driving system to perform well in all situations, especially safety-critical ones.

In this blog post, we’ll explore some early results of our latest research into the exciting area of multi-agent reinforcement learning and how that allows us to populate our Wayve Infinity Simulator with the emergent complexity of behaviour that is critical to making it representative of the real world. This, in turn, makes simulation an ever more powerful tool to accelerate our journey by evaluating the performance of our driving intelligence models.

Evaluating Driving Performance in Diverse Simulated Worlds – Wayve's Jagur I-Pace driving through the streets of London

This behaviour emerges from hundreds of dynamic actors or “agents”, each making independent decisions about how to act in a shared world based on rewards signals that approximate real-world goals, conventions, laws, and safety considerations.

Like the real world, our simulator needs to be populated with many agents: good drivers and bad; cautious and aggressive; law-abiding and law-breaking; and more

Please change your cookie settings to view embedded content, or view this video on YouTube.

The video below presents a first-person view of a trained reinforcement learning (RL) agent navigating the environment with a realistic density of other agents, including pedestrians, cars, buses and vulnerable road users. In addition to the ego vehicle, all of the dynamic traffic is powered by a diverse population of RL-trained agents. The video is shown at 2.5x real-time speed.

Please change your cookie settings to view embedded content, or view this video on YouTube.

This next video shows a first-person view of a trained RL agent driving in an environment with an artificially heightened density of pedestrian activity to stress-test the RL agent’s reactivity to pedestrians on the road. The video is shown at 2.5x real-time speed.

Please change your cookie settings to view embedded content, or view this video on YouTube.

Here you can see how we can simulate a busy intersection with a mixture of mild and aggressive RL agents controlling various cars and bicycles. The behaviour shown here does not always adhere to the rules of the road, but it portrays the good and bad driving behaviour we might experience in the real world. Creating scenes like this gives us a more challenging environment to evaluate our driving models. The video is shown at 2.5x real-time speed.

We want our end-to-end learned driving models to behave well in all situations, including challenging ones created by other moving agents. Evaluating models in the real world under the close supervision of our human safety drivers is the final stage of our testing pipeline. But before we get there, we first test in simulation and thus must recreate a diversity of behaviours in simulation. The population of agents is the key to efficiently generating a diverse range of behaviour that covers both good and bad driving and includes the “long tail” of unusual, weird, and rare edge cases that would be infeasible to program by hand.

Wayve diagram showing data volumes v edge cases

Wayve’s approach

We use deep neural networks trained with Reinforcement Learning to control the majority of agents in our simulations. The agents receive ‘privileged’ information about everything needed to act: perfect information about static obstacles, traffic lights, and the state of nearby agents. This gives substantial computation speed-ups compared to agents that would need the entire scene to be rendered, thus allowing us to run hundreds of agents per environment efficiently. Each agent’s neural network is based on multiple blocks of cross attention, with an architecture inspired by Perceiver – a performant version of a transformer architecture. To train this architecture, we use Phasic Policy Gradient, an extension to the widely popular PPO algorithm, as a base Reinforcement Learning algorithm for each agent. We manually design simple but meaningful rewards depending on the type of agent.

Wayve's reinforcement Learning algorithm diagram
Please change your cookie settings to view embedded content, or view this video on YouTube.

The video shows a zoomed-out view of a largely populated simulated town with hundreds of agents fluidly navigating the streets using reinforcement learning policies.

We use recent advances in multi-agent reinforcement learning to create simulated agents that achieve diverse, performant and realistic behaviours. Other research groups working on competitive games like Starcraft and Dota discovered that training a network to play against a diverse set of opponents was crucial to achieving superhuman performance. But beyond previous work, which solved behaviours in pre-existing games designed by humans, our approach requires us also to build the world itself.

Building on multi-agent population-based approaches, we create training and evaluation schemes that intelligently sample populations based on our diversity and realism goals. This allows us to create training curricula with increasing complexity and realism, where we control the ratios of optimal, rare, and adversarial agents.

Image showing training and evaluation schemes that intelligently sample populations based on our diversity and realism goals

We train in 50 different worlds with over 500 agents per environment. We train cars and cyclists jointly by bundling agents’ policies between roughly 20 different neural networks in each world (“Model Zoo”). (Currently, we use non-adaptive but randomised pedestrian behaviours, though the RL approach should extend to pedestrian behaviour in future). We vary the agents’ dynamic, visual and reward characteristics to further boost the overall complexity of our large and rich simulations. We built a large-scale distributed computational infrastructure to support the complexities of our multi-agent reinforcement learning approaches, allowing us to train a diverse set of behaviours from scratch in less than one day.

This results in a significantly more interesting simulated world in which to drive, populated by a multitude of driving skills and styles that result in emergent complexity more representative of the real world than previously possible.

Diagram showing single networks controlling multiple agents


Given our ability to rapidly generate huge quantities of complex emergent behaviour, as described above, we can now use that to evaluate the quality of our driving intelligence models. So far, we have described a world where all agents are RL agents using privileged sensing about the world. For evaluation, we now replace one of these agents with our end-to-end neural network in the simulated world. This agent will drive around using the non-privileged, fully-simulated camera and other sensor inputs. The other surrounding agents remain powered by the diverse RL agents described above, giving a rich and challenging reactive environment to test our driving intelligence.

The most interesting part is that we can now inexpensively sample a vast range of ‘initial conditions’ (starting points) for our evaluations. We can easily filter these initial conditions down to particular behavioural competencies we are interested in. For example, the following figure demonstrates the diversity of challenging evaluation cases we can thus generate, filtered down to the left or right turn competency. Each of these then becomes a starting point to evaluate our model.

Diagrams showing a series of left and right turns

Future directions

We have barely scratched the surface of what is possible with this approach and are working hard to extend it further. To match the diversity and complexity of the real world, we will need to include buses pulling over to pick up passengers, double-parked delivery vehicles, car doors opening unexpectedly, dogs dashing across the road, delivery scooters, and more. Increasing realism by mixing in real-world behaviour datasets is another research direction we are pursuing. The beauty of the multi-agent reinforcement learning approach is that a single learning framework has the potential to scale efficiently to all these cases.

Beyond the application, we also have fundamental research underway to ensure faster convergence of reinforcement learning algorithms, stability and efficiency of training, and measuring realism and diversity of our agents

Further reading

Back to top