Measuring Autonomy Performance, Part 3: Evaluating Autonomy On The Road
In the final part of this blog series, we will discuss how we can take the metrics framework we have defined previously, and generate meaningful data from real-world testing through our autonomous driving operations. If you are new, here are Part 1 and Part 2 of the series.
Evaluating performance before deployment
Whenever we deploy new models and software on the road, our first priority is safety. Allowing a bad model to be deployed to the vehicle endangers not only our Safety Drivers, but also potentially other road users. This is why we always run our models through a rigorous Driver Licensing pipeline to eliminate as much of this risk as possible.
Deploying a poor performing model on the road is also expensive. It wastes time from our Safety Drivers as well as takes up valuable time on the vehicle which could otherwise be spent collecting data and evaluating other models.
There are several different types of evaluation that we perform before issuing a license:
- Offline evaluation
- Deterministic simulation scenarios
- Large-scale randomised simulation scenarios
- Re-simulation and regression testing
Offline evaluation is the first validation data-point we can get after training has finished. This involves analysing our models against a held-out test dataset unseen during training to measure both perception and motion planning performance metrics.
Deterministic simulation scenarios are a fixed set of simulated worlds which can be used by our Driving Intelligence teams to rapidly iterate on driving policies in a virtual environment. Creating a deterministic gym of scenarios allows us to directly compare ideas and models. Simulation is the fastest way we can test closed-loop driving.
Large-scale randomised tests are used to build quantitative understanding of closed-loop model performance, by modulating various parameters in the simulated world and running thousands of simulations per scenario. We aim to cover all static, dynamic and environmental conditions to match our Operational Design Domain. This is enabled by our fully procedural approach to virtual world generation.
Re-simulation and regression tests come from our catalogue of real-world failure cases. We can process and re-simulate these recorded driving logs to reproduce the real-world failure in simulation. Once we can re-simulate the failure correctly, we can add this case to our Regression Suite, to verify that no new model will make the same mistake again.
Autonomy Feature Tests
Once we do get out on the road, the first type of testing that we perform is what we call an Autonomy Feature Test (AFT). This type of testing is designed to aid our ML Engineers in the development of a new feature.
In this case a “feature” may be anything from a new sensor input, to a new model architecture change, or a new training dataset. These tests will be typically only a short total driving distance and focused on verifying behaviour under a heavily constrained operational domain. This is to minimise as much as possible the influence of external factors.
AFTs are a necessary first step in risk management for on-road deployment. We aim to answer a few simple questions:
Qualitatively, does the model behaviour match what we saw in simulation?
Before we commit to many hours of quantitative evaluation, across a wider domain of scenarios, we should validate that the overall expected driving behaviour does not have any unexpected side-effects e.g. drifting in the lane, erratic steering etc. These tests can also often uncover new edge cases and failure scenarios, that we have not yet added to our simulations.
Is our AV Operations team happy to continue with further testing?
Do they deem it to be safe and beneficial to continue real-world evaluation at scale?
Evaluation Set Tests
Once a model has passed the first two stages, we can then license the model for a full quantitative evaluation. This will be a combination of:
- Testing over predefined routes,
- Testing over arbitrary driving exploration,
- Targeted testing to actively evaluate specific scenarios or various weather and lighting conditions.
The goal of this real-world evaluation is to:
- Gather enough driving data for meaningful statistical inferences when comparing models
- Test over a breadth of scenarios to gain a full view of performance across our Operational Design Domain (ODD)
- Target edge-cases for learning and identify failures where necessary
Since we already have an idea of the type of data we will need for our performance metrics, we can direct our Safety Drivers to iteratively gather data where necessary to hit the required sample sizes.
This is where our ODD-level metrics really lets us target edge cases and failure modes. If we just have a per-route or an overall metric such as distance per intervention, then targeting failure modes will cause our metrics to suffer. We have a management incentive to avoid this.
Because we operate with per-ODD metrics, we can differentiate metrics changes due to:
- Improving performance
- Expanding ODD
A model may spend weeks or even months in this phase. It is ultimately this Evaluation Set data that will help us to populate our metrics, track progress in our graphs of autonomy and to conduct experiments with scientific rigour.
This is the final post in our series on Measuring Autonomy Performance. You should hopefully now have a much better idea of how Wayve is thinking about the problem of measuring performance and safety for autonomous driving. However, the concepts and implementation details here are always evolving as we learn more about what it takes to build this technology. Look out for further blog posts from us, with deep dives on simulation, data engineering and how we are setting ourselves up for scale.
As always, please come chat to us and let us know if you have any feedback. We’d love to hear from you.