Measuring Autonomy Performance, Part 4: Introducing Driver Score 2.0
This is Part 4 of a series of posts, in which we dig into the details of how we’ve progressed our autonomy metrics with the latest in data science. Specifically, we explore how we’re using machine learning to accelerate the speed and robustness of the data insights we are gathering from every kilometre of real-world testing. Having better performance metrics allows our R&D teams to iterate quickly and conduct more focused experiments. By applying the latest methodologies in data science, we can learn more from a smaller sample of real-world driving test runs, which enables us to train and test models more quickly.
Accelerating our autonomy performance with data science
Data Science at Wayve faces a complex and fascinating opportunity to pioneer new methodologies for measuring the intelligence of our autonomy system. Our aim is to implement cutting-edge methods that enable our machine learning (ML) research and engineering teams to learn faster and gather more robust insight from our real-world testing. Doing so enables us to create new metrics, like Driver Score 2.0, which captures the complexity of urban driving and allows us to compare model performance across a diverse array of driving scenarios.
Applying the latest data science methods like MLRATE (Meta, 2022) enables us to iterate faster because we can halve the amount of real-world driving required to achieve statistically significant results robustly.
What makes measuring driving performance so difficult?
One of the challenges when comparing differences in driving performance between two models is that driving conditions vary massively day-to-day and minute-to-minute. With each test run, models encounter vastly different variables such as weather, traffic conditions, roadworks and encounters with other road users, pedestrians and cyclists. No two test runs are the same, even on the same route.
That’s why the standardised industry metric, disengagements per 1,000 kilometres driven, provides limited value in assessing the capabilities of an autonomous driving system. This metric does not factor in the difficulty of the miles driven and can be gamed by driving on simpler routes. It also doesn’t provide our R&D teams with the level of insight required to know if the ‘driving intelligence’ of our autonomous system has improved its learned driving behaviours. To train models to drive in complex urban environments like London, we need robust metrics that we can gather quickly to assess performance in very specific driving scenarios, such as making an unprotected right turn in a diverse array of traffic conditions.
How do you create a better metric that fairly tracks the progress of our driving models?
To create a more insightful measure of autonomy performance, our Data Science team set out to achieve four goals with Driver Score 2.0:
- To make more robust conclusions when comparing models
- To reach those conclusions faster based on a smaller sample of real-world driving data
- To tune model comparisons to better suit different driving scenarios such as ‘typical London driving’
- To extract as much insight as possible from every real-world kilometre driven
When developing models, a key bottleneck that many AV companies face is it takes a long time to collect enough real-world testing data for robust performance measurement, which can slow down iteration speed. To learn faster, Wayve has applied the latest machine learning techniques to gather richer insight from a smaller sample of testing data. Achieving this enables our R&D teams to iterate quickly and conduct more focused experiments. This is a key lever for improving autonomy performance efficiently.
Moreover, since there is never an identical distribution of real-world driving conditions across all experiments, we needed to find a solution that could account for the complexity of urban driving while reducing variability between runs. Rather than treat variable driving conditions as noise that averages out when experimenting, we instead use the signal to learn from every interaction.
Some context: simple A/B tests can be highly inefficient for measuring driving performance
Randomised controlled tests (RCTs) are traditionally the gold standard to measure improvements. RCTs are often used by data scientists to test machine learning (ML) models online, typically randomly splitting a user base into a control group (users treated with the old ML model) and variant group (users treated with the new ML model). Then, they observe the difference in the means of a specific metric between the two groups (an A/B test, deriving the ‘difference-in-means estimator’).
These A/B experiments often capture millions of data points for this process. To ensure a representative sample for testing, large scale randomisation is important. Randomisation also facilitates fairer comparisons and gives greater statistical confidence that a measured improvement is genuine rather than just noise.
But unfortunately, large-scale randomisation doesn’t work for real world driving.
To randomly test routes, our data sample would have huge differences in the scenery that a driving model has to navigate, from traffic islands to construction zones to one-way traffic. Without the benefit of a huge sample size, this variance would make us uncertain in our conclusions. With real-world testing, it’s not feasible to collect that much driving data for every new model, especially when our researchers and engineers need to iterate quickly.
Even testing two driving models on the same route (i.e. paired tests) doesn’t solve this because no two routes will ever be the same. For example, we cannot control for dynamic agents such as cars or cyclists, or even scenarios where one model has seen only green traffic lights while the other has seen only red traffic lights.
Moreover, restricting to specific routes could mean overfitting the data to these specific routes, which prevents us from building models that can drive anywhere. Instead, we need to be able to compare models that have driven on a multitude of routes to gauge how well our cars can perform in areas they have never driven in before.
If A/B tests are insufficient, then which variance reduction techniques can we use to attain statistical significance faster from real-world driving data?
Data scientists often employ variance-reduction techniques to improve the detection of genuine improvements that would be too small to parse with statistical confidence at a lower sample size.
For example, you might find that drivers are more performant at roundabouts when the traffic density is low, and less so, at higher traffic densities. So, if you measured roundabout performance across all traffic densities, then you would see a larger variance in performance. This would make you less sure if an observed performance improvement is genuine or due to chance. But if we expect drivers to perform better when traffic density is low, rather than treat the variance as noise, we should incorporate this pre-existing knowledge into our experiment to reach statistical significance more quickly.
Roundabout with no traffic
Roundabout with high traffic density
Stopping for a truck
Yielding to a car
A common method used for variance-reduction is through Controlled-experiments Using Pre-Existing Data (CUPED, Microsoft 2013), where some linear covariates (such as traffic density) are used to adjust the simple difference-in-means estimator. CUPED is highly effective at variance reduction for a few variables that are linearly related.
In the graphic below, the three plots on the left-hand side show how it can be hard to achieve statistical significance by simply looking at the difference in the unconditional means of two groups (the difference along the X axis) when the variance of each group is large. The histograms are wide, and many of the points between the variant and control overlap, so we are less sure if there is a genuine difference in the distribution of means, or whether it is just noise from sampling. But in the graphs on the right-hand side, we can control for a linearly-related covariate in the modelling. This reduces the variance in measurement. In other words, the dispersion around the conditioned mean is smaller so statistical power is higher.
But we have found that the CUPED method doesn’t work well when you are adjusting for many covariates with complex non-linear relationships such as what you encounter with driving.
When measuring driving performance, we need to incorporate many confounding variables into our testing such as static scenery differences (e.g. bus lanes, pedestrian crossings), dynamic agents (e.g. traffic density, presence of cyclists), environmental factors (e.g. weather, luminosity) and even human biases from drivers, which often have complex non-linear interactions between them too. Building a more complex machine learning model would provide a more sophisticated proxy to facilitate controls for these covariates.
MLRATE allow us to reduce variance using ML models in a robust way
Using a machine-learning-adjusted treatment effect estimator (MLRATE, Meta 2022), we can exploit the complex non-linear relationships that machine learning models can learn between such confounding variables and performance. MLRATE allows us to implement this in a robust way using a generalised linear model. This means we can achieve statistical significance with a smaller sample. At Wayve, we’ve been able to use MLRATE to cut our sample size in half while still achieving statistical significance. This is huge because it has created efficiencies in our experimentation workload and halved the length of testing time needed to get to robust insight.
MLRATE follows two steps to reduce variance using ML models. The first step involves building a machine learning model to predict our performance metric using relevant covariates. The second step uses these predictions as a control variable in a generalised linear model.
Step 1: Train and calibrate an ML model
To control for all desired covariates, we first build a machine learning model to predict our performance metric using relevant covariates.
We train an artificial neural network (ANNs) on a balanced dataset using all of the features we are interested in controlling for such as scenery, dynamic agents and environmental factors. ANNs tend to be overconfident in their prediction so we use isotonic regression to ensure our performance predictions are linearly related with actual performance. This calibration becomes important in step 2.
In practice, we randomly split our training data into two datasets. We train and calibrate two performance-prediction models with the same architecture, one on each dataset. Then we predict performance for each sample in the dataset using the model it was not trained on. This method of ‘cross-fitting’ using out-of-sample prediction is important to avoid attenuation biases that could result from over-fitting if in-sample predictions were used.
Step 2: Estimate the ML-adjusted treatment effect using a generalised linear model
The second step involves running a generalised linear model, but instead of using x (a linear covariate) as per CUPED, g(X) is used to show how we incorporate predictions from the ML step in MLRATE.
This regression step ensures our testing is robust even if the neural network prediction is poor. In fact, regardless of how bad the outcome predictions are from Step 1, MLRATE has an asymptotic variance no larger than the usual difference-in-means estimator. In other words, MLRATE benefits if the neural network predictions from the previous step are good. But it is also robust to increased variance if the neural network predictions are poor.
Tailoring model comparisons using weights
At Wayve, we need to take the MLRATE methodology a step further. We tailor our model evaluation to enable comparisons that better suit the specific driving scenario attributes of the routes.
For example, our models may have been tested mainly on 20mph, high traffic density roads near Kings Cross in London. But on the typical routes used by our grocery pilot partners, we have more 30mph, multi-lane roads with lower traffic density. Therefore we need to normalise our metric by weighting our 30mph, multi-lane scenarios more highly when doing comparisons to better match the target distribution of our grocery delivery routes. We use a re-weighting framework to tailor model comparisons to better match a target distribution rather than treat each scenario as equally important.
Typical London route (20mph, single lane roads)
Grocery routes (30mph, multi-lane roads)
Our re-weighting framework was inspired by how decision scientists run generalised linear models on survey data. We create weights for every trial (analogous to design weights in survey data) using features we know are predictive of performance. Weighting is designed to ensure the joint distribution of the features of interest in our target distribution. Weights are cropped to reduce variance and normalised to ensure the degrees of freedom remain the same for the second regression step of MLRATE.
Automating this analysis to accelerate our learning
We have automated all of this analysis in an internal tool at Wayve, seen here in the image below. Our ML researchers can provide specific filters and select the routes they wish to tailor for, before automatically training our neural networks and running the generalised linear model on-the-fly. This data science tool produces detailed reports in minutes, which facilitates rapid and robust comparisons between any two models. Enabling our teams to streamline their efforts and ensure that we are deriving insights from every kilometre driven.
Benefits of Driver Score 2.0
By normalising our model performance metric, we now have a common "yardstick" that factors out differences between easier or harder routes when comparing model performance. The new re-weighting framework provides a more accurate and insightful method for tailoring model comparisons to different driving domains and tracking baseline model performance improvement over time against ‘typical London driving.’
Moreover, we’re able to gather robust insights faster because we only need half the amount of real-world driving to achieve statistically significant results. These improvements enable us to iterate faster by conducting more focused experiments in less time, which accelerates our efforts to improve performance of AV2.0.
Thank you for following along. We hope this detailed discussion gives you a better idea of how we are applying the latest thinking in data science to accelerate our autonomy performance. The unique complexity of driving presents really interesting challenges for data scientists to solve. If these concepts interest you and you would like to work on these problems with us, we have a number of open roles at Wayve to join our team.