Skip to content

Measuring Autonomy Performance, Part 2: Scenario-Based Evaluation At Scale

in Engineering

This is Part 2 of a series of posts, in which we go into detail about what it takes to build a scalable evaluation framework for autonomous driving. If you haven’t read Part 1, you can find it here.

Granular Scenario-Based Evaluation

In Part 1 we discussed why we care about granular, consistent and automated metrics, and how this can help us to drive incremental improvements in our overall autonomy performance. So let’s take a look at how we can define our metrics in more detail.

Within our Operational Design Domain (ODD) we have defined three main dimensions of variability:

  • Scenery - the static attributes of the world such as road layout, traffic rules and street decorations,
  • Environment - variable world attributes such as weather and lighting conditions,
  • Dynamic - other agents within the scene such other vehicles, pedestrians, cyclists etc.
table of operational design domain examplestable of operational design domain examples

These labels make intuitive sense when observing a fixed point in time or a static scene. But driving is a continuous activity, so we should also apply these labels to our whole driving data accordingly over time.

Here’s what happens when we label every frame within a recorded driving log.

scenarios labelled over timescenarios labelled over time

Now we have a pretty good idea of what exactly occurred within our scene, and when. The final step is to convert this into scores we can use to evaluate the quality of autonomous driving.

scenario labelled over time with interventionsscenario labelled over time with interventions

In our autonomous driving trials, our Safety Drivers will intervene whenever the system is not behaving correctly. When we layer this intervention data on top of our scenario labels, we now get a clear picture of the state of the world at the point of failure (as well as prior- and post-failure).

If there is no intervention during the time segment, then we can consider this to be a successfully completed attempt at the scenario. Likewise, if the intervention falls within the time segment, then we consider this a failure of that particular ODD item.

Route with heatmapRoute with heatmap

Aggregating this data up across many thousands of hours of driving, we can start to build a higher-level view over the performance of our autonomy, in a more robust and statistically meaningful way. For example, in the example route above, we know that we have a 95% success rate in left-turns at give-way intersections, on single-lane roads.

Beyond fixed-route evaluation

In the above example we showed how we can use fixed route evaluation to turn our labelled data into an autonomy performance metric. However, limiting ourselves to predefined routes or geo-fenced areas is not entirely representative of the variety and complexity of real-world driving. It will also not give us enough data to prove our Safety Case, long-term. In addition, if you consider the requirements of a productionised fleet of autonomous vehicles, we will need to measure autonomy performance over any arbitrary driving data, regardless of geography or scenario. With this goal in mind, we need to set up our infrastructure to support this, at scale.

We will need to measure autonomy performance over any arbitrary driving data, regardless of geography or scenario.
heatmap of arbitrary drivingheatmap of arbitrary driving

Scaling up scenario and intervention labelling

The framework we described above is flexible and general enough to cater for changing definitions of our ODD. However, it all hinges on the ability to be able to accurately and scalably label the driving data at the lowest level. At Wayve, we try to build things with very little generalisation debt where possible, so let’s take a deeper look at how we can produce scenario labels for millions of hours of driving.

Fully-automated scenario labelling


We can extract a lot of information about the roads and neighbourhoods that we drive in, from mapping providers such as OpenStreetMap, Mapbox, HERE Maps, Google Maps and others. This includes:

  • Lane information
  • Speed limits & restrictions
  • Traffic signs & signals
  • Navigation

Note: we do not rely on HD mapping data which is much more costly to collect, update and manipulate across millions of kilometres of roads.


Similarly, weather data is readily available through third party API providers such as OpenWeatherMap, Dark Sky and Lighting conditions can also be easily inferred (at least at a basic level) from time of day and weather data.


This is where automatic labelling starts to get more interesting. To detect other agents in the scene, such as drivers and pedestrians we must rely on our own sensor data at the time of recording. From this data we need to extract the scenario-labels as described earlier. Luckily, we are already building a deep-learning based computer vision system which is capable of driving! We can re-purpose some parts of our perception stack to help us here.

Our perception system can detect up to 26 classes of scene elements, including:

  • Vehicles
  • Pedestrians
  • Traffic signals
  • Lane markings and so on...
perception based labellingperception based labelling


Interventions are inherently a human decision and action. In most cases, we can infer the reason for intervention based on a combination of vehicle telemetry and ODD labels at that time. For example, if we hit the brakes while coming up to a red traffic light, this is a clear indication that the AV did not correctly prepare to stop at the intersection. In these cases we can apply a rules-based algorithm to auto-label these interventions.

However, other cases may be less obvious, so we can also use the opportunity to learn from the expertise of our Safety Drivers, and add extra manual annotation to clarify the intent of the action.

Accuracy & Robustness

We have essentially reduced our evaluation task to a machine learning problem of computer vision / object classification. However, since we are using the results of this classification to evaluate the overall performance of our driving, we must be pretty confident in its accuracy, i.e. we wish to minimise the possibility of false positives and false negatives. One way to reduce the uncertainty in our measurements is to try and combine multiple sources of information.

Traffic light example:

  • Third party mapping providers may give us the location of traffic lights in geo-spatial coordinates, combined with the GPS in the vehicle, we can estimate proximity
  • Camera data + perception labels may yield a similar proximity estimate
  • Intervention annotation may include a “reason = failed to stop at red traffic light”

These data-points all converge on the idea that there was indeed a traffic signal in the scene, which applies to our lane, and our autonomy failed to handle the situation correctly.

In addition, like any well-engineered data-pipeline, we aim to clean and filter the input data as much as possible before using it for analytics. Ultimately, for any model to make it to one of our leaderboards, we require a minimum number of autonomous driving kilometres, and a minimum number of attempts per scenario, in order to try and reduce the effects of noise in the measurements.

Dealing with a moving target

In Part 1 we discussed how influential the role of ImageNet and standardised measurement has been in the field of machine learning. What we are aiming to do is build a similar measurement framework for autonomous driving. One key difference, however, is the fact that we are not operating in a static domain.

The world we operate in is constantly changing, so we must ensure that our performance metrics reflect this.

For reference, consider even simple road maintenance tasks. In 2019/20, the total number of potholes filled in was 1.48 million in the UK. For the vast majority of these, temporary lane-closures and traffic diversions would be put in place. Similarly, the recent COVID-19 response from local authorities resulted in rapid expansion of bike lanes and pedestrian footpaths and one-way systems across the whole country.

This constant state of flux is why we rely on statistical sampling to converge on an accurate score within each dimension of our ODD, rather than using a fixed dataset. Using Sequential Testing, we can try to minimise the time it takes to evaluate a feature of our system.

Sequential TestingSequential Testing

HORIBA MIRA, Test Methods for Interrogating Autonomous Vehicle Behaviour – Findings from the HumanDrive Project, 2019, 21

Through evaluative autonomous driving, we will be able to gather enough data points against each scenario in our ODD, in order to confidently prove that the system is safe and robust to change. Each scenario will require a different amount of data, depending on how frequently it is likely to be encountered on the road. We can also use this framework to measure performance in entirely new environments in the same way as against known routes, with very little prior knowledge required.

Operationalising these metrics

How to actually collect enough driving data on the road to achieve the desired number of samples is the topic of Part 3 in this series.

If you have any questions, or comments about our approach to autonomy performance measurement, connect with us on Twitter or Linkedin​. Also, we're hiring!

More Articles

Measuring Autonomy Performance, Part 1: Going From Zero To One

Part one in our series on measuring autonomy performance.
12th October 2020
  • About
  • Safety
  • Technology
  • Join Us
  • Blog
  • Wayve© 2023

    For press, partnerships & investment inquiries, please contact:
    [email protected] // [email protected] // [email protected]

    Download our Press Kit here