12 October 2020  |  Engineering

Measuring autonomy performance. Part 1: going from zero to one

This is part one of a series of posts, in which we go into detail about what it takes to build a scalable evaluation framework for autonomous driving.

Measuring autonomy performance, image of a chart, pencil and ruler showing increasing figures

Why do we care about measuring autonomy performance?

The field of machine learning in computer vision has been building models since at least the early 60’s. However, it’s only in the last decade or so, that we have started to see many breakthroughs in performance and capability. There are many contributing factors here, but one that is often overlooked is the role of standardised benchmarking.

Enter ImageNet in 2009. This project has been pivotal in the deep learning breakthroughs of recent years. It’s incredible that in the span of 10 years we have gone from roughly 50% Top 1 accuracy, to 88.5% and climbing.

Image net image showing lots of tiny images making one big image

“That which is measured, improves”​

Chart showing classification Accuracy Leaderboard on ImageNet Classification Accuracy Leaderboard on ImageNet

ImageNet for Autonomous Driving

Wayve’s philosophy is that the autonomous driving problem is inherently a machine learning problem, and we want to transition away from the traditional rules-based approaches of the past to a more scalable learning-based approach. To achieve this we need our own ImageNet equivalent benchmark.

State of the industry

The industry standard metric for measuring progress in autonomous driving is the California DMV’s “miles per intervention” metric. Many companies will also track internal scores by measuring interventions over fixed routes, which can be repeated daily. We believe this is nowhere near sufficient for the following reasons:

  • Not all roads and cities are of equal complexity. An urban driving environment will naturally lead to a higher rate of intervention as opposed to an empty highway.
  • Not all interventions are of equal severity. A slight adjustment in trajectory through a turn may not have the same consequences as hitting the brakes to prevent a collision.
  • A single high-level metric such as miles / intervention does not give us sufficient confidence for a safety argument, given the diversity and complexity of driving tasks.
  • A single metric such as “number of interventions” does not give us enough feedback internally on where we should spend our time in order to improve the performance of the system.

These metrics have long been criticised — for example, see posts by Waymo, Aurora, Cruise and Oxbotica — but no solution has been presented.

How does Wayve think about this problem?

We believe that our performance metrics:

  1. Must be granular
  2. Must be consistent, yet flexible
  3. Must be measured autonomously in order to scale

Granular

We want to be able to answer questions such as “under which conditions does our system fail most often?”. This gives us a clear signal about where to invest more effort. We similarly want to celebrate the successes and prove our safety case with data. Metrics should be broken down into granular scores across scenario, environment and dynamic dimensions. These are the dimensions of our Operational Design Domain (ODD) which defines the scope of our system’s designed capabilities.

Consistent, yet flexible

We want this framework to be able to apply to any underlying hardware and software system such that we don’t have to change our metrics whenever we change the solution. However, we also acknowledge that there will be new scenarios and conditions added to our ODD over time, so we must be able to extend easily. We must not encourage short-term thinking and generalisation debt.

Automated

In order to apply this framework from mile one to mile one million, it must be designed for scale from the beginning. This means no human labelling and annotation.

The fleet learning loop

Let’s back up a bit and discuss at a high level how our Autonomous Driving system will reach maturity. Since we are a machine learning company, we strongly believe in setting up our company, technology and processes in a way that will learn and improve over time. We call this process the Fleet Learning Loop.

Piece chart showing the stages of Fleet Learning at Wayve

The stages of Fleet Learning at Wayve:

  1. Collect data
  2. Ingest data / build datasets and learning curriculum
  3. Train deep learning model
  4. Re-simulate failure scenarios
  5. Evaluate model in simulation
  6. Evaluate model in real-world driving
  7. Measure performance & identify areas for improvement
  8. Repeat

The important thing to note here is that iterations through the loop should yield improvements in autonomy performance. Everyone in the company plays a part in moving our products through this loop. We should be able to run through this process for every item in our ODD, and demonstrate improvements over time.

It’s also worth noticing that the ML part of this loop is only a small fraction of the total effort required. This is consistent with the view that ML is 90% an engineering challenge. We want to be able to iterate around this loop as quickly as possible, and that requires solving some of the engineering problems along the way.

So how do we build a Fleet Learning Loop?

In the beginning, when there is no product, progress is not measured in incremental percentage point improvements, but rather a set of binary steps. Can we collect data for training? Can we train a model? When we train a model, how do we know if it’s any good? And so on.

We prioritise completing an MVP of each stage of the entire loop for each driving scenario of our ODD, before we attempt to iterate on overall model performance. This gives us the confidence to build the entire pipeline for scale, avoiding metrics which push our team into generalisation debt and short-term thinking. For example, it would be easy to build a bespoke dataset to address a problem in isolation quickly, such as road sign detection. However, with this framework, we push ourselves to develop the data-pipelines, metrics and trainers to learn at scale in our overall machine learning driving product.

This allows us to unblock some fundamental technical barriers at the beginning such as:

  • Integration with the vehicle
  • Data collection and ingestion from real-world driving, at scale
  • Data collection in simulation (including virtual world assets & expert driving policy)
  • Data loading / caching for training
  • Performance evaluation of scenarios in simulation
  • Performance evaluation of scenarios in real-world

As we mentioned earlier, our ODD will expand rapidly, and with increasing complexity, so our approach in each scenario must be able to scale appropriately. The first iteration will be slow, since we have to build a lot of tooling, however subsequent iterations should get progressively quicker to the point of zero human intervention required when adding new features in the long term.

Case Study: Teaching our Driving Intelligence to use its indicators

Please change your cookie settings to view embedded content, or view this video on YouTube.

To put this into context, let us give a quick example:

Autonomy Goal: AV should automatically use its indicators when approaching a turn

  1. Collect data: Expert drivers gather driving data to serve as training examples for our models (this should happen naturally, since expert driving will use indicators very often).
  2. Data curation: Driving data (with indicators) is passed through our cloud ingest pipeline, to be made available for training.
  3. Training: Our Research Engineers train an indicator-aware end-to-end ML model using supervised imitation learning.
  4. Re-simulation: We implement indicator controls in simulation under various procedural scenarios, and collect examples of indicator control from real data to serve as test cases.
  5. Evaluation in simulation gives us a first understanding of whether the baseline model works at all. We can use this to “license” our models for real-world testing.
  6. Evaluation on the road gives a baseline score for model performance on the road (supervised by our Safety Drivers).
  7. Measure performance: Results from our simulation & real-world evaluation appear automatically on internal dashboards.

This is a real example of a feature we implemented at Wayve. In this case, we were able to get a model to drive with indicator lights in under 2 weeks from idea to autonomous driving demo. This is in large part thanks to our approach of using end-to-end machine learning, which enables this fast iteration cycle. The video above shows an example of our model autonomously navigating through a roundabout scenario, on a road unseen during training, having learnt to actuate its indicator lights correctly.

Moving the needle – Autonomy performance matrix

Once we have our foundations in place, this is the time that we can unleash our Research Engineers to start iterating over this loop and start to push the performance of the feature. We track the results from simulation and real-world evaluation in what we call the Graphs of Autonomy.

Scatter chart showing tracked results from simulation and real-world evaluation in what we call the Graphs of Autonomy

Each iteration cycle yields greater autonomy performance, which we curate into leaderboards for our Research Engineers as a form of healthy competition internally. The best score to date, per scenario, makes it to our Autonomy Performance Matrix, giving a performance metric for each individual driving domain.

Table showing Autonomy Performance Matrix Note – placeholder numbers for illustration purposes

We can also run our evaluation suite on expert driver data, to estimate what human-level performance might look like. Once we reach greater than human-level performance in each metric, we can be confident of our system’s ability to drive safely on public roads.

In the next post we will dive into more details about how we define our Autonomy Metrics, how we handle the complexity of evaluating real-world driving at a granular level, and how we can do this at scale.